Dissertations / Theses on the topic 'Cache hierarchy'
Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles
Consult the top 32 dissertations / theses for your research on the topic 'Cache hierarchy.'
Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.
You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.
Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.
Huang, Cheng-Chieh. "Optimizing cache utilization in modern cache hierarchies." Thesis, University of Edinburgh, 2016. http://hdl.handle.net/1842/19571.
Full textSettle, M. W. Alexander. "An adaptive chip multiprocessor cache hierarchy." Connect to online resource, 2007. http://gateway.proquest.com/openurl?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&res_dat=xri:pqdiss&rft_dat=xri:pqdiss:3256380.
Full textKurian, George Ph D. Massachusetts Institute of Technology. "Locality-aware cache hierarchy management for multicore processors." Thesis, Massachusetts Institute of Technology, 2015. http://hdl.handle.net/1721.1/97806.
Full textCataloged from PDF version of thesis.
Includes bibliographical references (pages 185-194).
Next generation multicore processors and applications will operate on massive data with significant sharing. A major challenge in their implementation is the storage requirement for tracking the sharers of data. The bit overhead for such storage scales quadratically with the number of cores in conventional directory-based cache coherence protocols. Another major challenge is limited cache capacity and the data movement incurred by conventional cache hierarchy organizations when dealing with massive data scales. These two factors impact memory access latency and energy consumption adversely. This thesis proposes scalable efficient mechanisms that improve effective cache capacity (i.e., by improving utilization) and reduce data movement by exploiting locality and controlling replication. First, a limited directory-based protocol, ACKwise is proposed to track the sharers of data in a cost-effective manner. ACKwise leverages broadcasts to implement scalable cache coherence. Broadcast support can be implemented in a 2-D mesh network by making simple changes to its routing policy without requiring any additional virtual channels. Second, a locality-aware replication scheme that better manages the private caches is proposed. This scheme controls replication based on data reuse information and seamlessly adapts between private and logically shared caching of on-chip data at the fine granularity of cache lines. A low-overhead runtime profiling capability to measure the locality of each cache line is built into hardware. Private caching is only allowed for data blocks with high spatio-temporal locality. Third, a Timestamp-based memory ordering validation scheme is proposed that enables the locality-aware private cache replication scheme to be implementable in processors with out-of-order memory that employ popular memory consistency models. This method does not rely on cache coherence messages to detect speculation violations, and hence is applicable to the locality-aware protocol. The timestamp mechanism is efficient due to the observation that consistency violations only occur due to conflicting accesses that have temporal proximity (i.e., within a few cycles of each other), thus requiring timestamps to be stored only for a small time window. Fourth, a locality-aware last-level cache (LLC) replication scheme that better manages the LLC is proposed. This scheme adapts replication at runtime based on fine-grained cache line reuse information and thereby, balances data locality and off-chip miss rate for optimized execution. Finally, all the above schemes are combined to obtain a cache hierarchy replication scheme that provides optimal data locality and miss rates at all levels of the cache hierarchy. The design of this scheme is motivated by the experimental observation that both locality-aware private cache & LLC replication enable varying performance improvements across benchmarks. These techniques enable optimal use of the on-chip cache capacity, and provide low-latency, low-energy memory access, while retaining the convenience of shared memory and preserving the same memory consistency model. On a 64-core multicore processor with out-of-order cores, Locality-aware Cache Hierarchy Replication improves completion time by 15% and energy by 22% over a state-of-the-art baseline while incurring a storage overhead of 30.7 KB per core. (i.e., 10% the aggregate cache capacity of each core).
George Kurian.
Ph. D.
Valls, Mompó Joan Josep. "Improving Energy and Area Scalability of the Cache Hierarchy in CMPs." Doctoral thesis, Universitat Politècnica de València, 2017. http://hdl.handle.net/10251/79551.
Full textConforme se incrementa el número de núcleos en las nuevas generaciones de multiprocesadores en chip, los CMPs deben de escalar en prestaciones, área y consumo energético para cumplir con las demandas de un número núcleos mayor. Los protocolos basados en directorio constituyen la alternativa más escalable. Un directorio convencional, no obstante, sufre de una utilización ineficiente de almacenamiento y energía. En primer lugar, los grandes y poco escalables vectores de compartidores consumen una cantidad de energía de fuga y de área innecesaria, especialmente si se tiene en consideración que la mayoría de los bloques en un directorio solo se encuentran en la cache de un único núcleo. En segundo lugar, aunque incrementar el tamaño y la asociatividad del directorio aumentaría las prestaciones del sistema, esto supondría un incremento notable en el consumo energético. Esta tesis estudia las diferencias significativas entre el comportamiento de bloques privados y compartidos en el directorio, lo que nos lleva hacia una gestión separada para cada uno de los tipos de bloque. Proponemos el PS-Directory, una cache de directorio de dos niveles que mantiene el reducido número de las entradas compartidas, que son los que se acceden con más frecuencia, en una estructura pequeña de primer nivel (concretamente, la Shared Directory Cache) y que utiliza una estructura más grande y lenta en el segundo nivel (Private Directory Cache) para poder mantener la información de los bloques privados. Los resultados experimentales muestran que, comparado con un directorio convencional, el PS-Directory consigue mejorar las prestaciones a la vez que reduce el área de silicio y el consumo energético. Ya que el ratio compartido/privado de las entradas en el directorio varia entre aplicaciones y entre las diferentes fases de ejecución dentro de las aplicaciones, proponemos el Dynamic Way Partitioning (DWP) Directory. El DWP-Directory reduce el número de vías que almacenan entradas compartidas y permite que éstas se enciendan o apaguen en tiempo de ejecución según los requisitos dinámicos de las aplicaciones según un algoritmo de reparticionado. Los resultados muestran unas prestaciones similares a un directorio tradicional de alta asociatividad y un área similar a otros esquemas recientes del estado del arte. Adicionalmente, el DWP-Directory obtiene importantes reducciones de consumo estático y dinámico. Esta disertación también se enfrenta a los problemas de escalabilidad que se pueden encontrar en las memorias cache. En un acceso a la cache, se accede a cada vía del conjunto en paralelo, siendo así un acción costosa en energía. Esta tesis presenta la arquitectura PS-Cache, un diseño energéticamente eficiente que reduce el número de vías accedidas sin perjudicar las prestaciones. La PS-Cache utiliza la información del estado privado-compartido del bloque referenciado para reducir la energía, ya que tan solo accedemos a un subconjunto de las vías que mantienen los bloques del tipo solicitado. Los resultados muestran unos importantes ahorros de energía dinámica. Finalmente, proponemos otro diseño de arquitectura energéticamente eficiente que se puede aplicar a cualquier tipo de memoria cache asociativa por conjuntos. La propuesta, la Tag Filter (TF) Architecture, filtra las vías accedidas en el conjunto de la cache, de manera que solo se mira un número reducido de vías tanto en el array de etiquetas como en el de datos. Esto permite que nuestra propuesta reduzca el consumo de energía dinámico de las caches sin perjudicar su tiempo de acceso. Los resultados experimentales muestran que este mecanismo de filtrado es capaz de obtener un consumo energético en caches asociativas por conjunto similar de las caches de mapeado directo. Los resultados experimentales muestran que las propuestas presentadas en esta tesis consiguen un buen compromiso entre estos tres importantes pilares de diseño.
Conforme s'incrementen el nombre de nuclis en les noves generacions de multiprocessadors en xip, els CMPs han d'escalar en prestacions, àrea i consum energètic per complir en les demandes d'un nombre de nuclis major. El protocols basats en directori són l'alternativa més escalable. Un directori convencional, no obstant, pateix una utilització ineficient d'emmagatzematge i energia. En primer lloc, els grans i poc escalables vectors de compartidors consumeixen una quantitat d'energia estàtica i d'àrea innecessària, especialment si es considera que la majoria dels blocs en un directori només es troben en la cache d'un sol nucli. En segon lloc, tot i que incrementar la grandària i l'associativitat del directori augmentaria les prestacions del sistema, això suposaria un increment notable en el consum d'energia. Aquesta tesis estudia les diferències significatives entre el comportament de blocs privats i compartits dins del directori, la qual cosa ens guia cap a una gestió separada per a cada un dels tipus de bloc. Proposem el PS-Directory, una cache de directori de dos nivells que manté el reduït nombre de les entrades de blocs compartits, que són els que s'accedeixen amb més freqüència, en una estructura menuda de primer nivell (concretament, la Shared Directory Cache) i que empra una estructura més gran i lenta en el segon nivell (Private Directory Cache) per poder mantenir la informació dels blocs privats. Els resultats experimentals mostren que, comparat amb un directori convencional, el PS-Directory aconsegueix millorar les prestacions a la vegada que redueix l'àrea de silici i el consum energètic. Ja que la ràtio compartit/privat de les entrades en el directori varia entre aplicacions i entre les diferents fases d'execució dins de les aplicacions, proposem el Dynamic Way Partitioning (DWP) Directory. DWP-Directory redueix el nombre de vies que emmagatzemen entrades compartides i permeten que aquest s'encengui o apagui en temps d'execució segons els requeriments dinàmics de les aplicacions seguint un algoritme de reparticionat. Els resultats mostren unes prestacions similars a un directori tradicional d'alta associativitat i una àrea similar a altres esquemes recents de l'estat de l'art. Adicionalment, el DWP-Directory obté importants reduccions de consum estàtic i dinàmic. Aquesta dissertació també s'enfronta als problemes d'escalabilitat que es poden tro- bar en les memòries cache. Les caches on-chip consumeixen una part significativa del consum total del sistema. Aquestes caches implementen un alt nivell d'associativitat. En un accés a la cache, s'accedeix a cada via del conjunt en paral·lel, essent així una acció costosa en energia. Aquesta tesis presenta l'arquitectura PS-Cache, un disseny energèticament eficient que redueix el nombre de vies accedides sense perjudicar les prestacions. La PS-Cache utilitza la informació de l'estat privat-compartit del bloc referenciat per a reduir energia, ja que només accedim al subconjunt de vies que mantenen blocs del tipus sol·licitat. Els resultats mostren uns importants estalvis d'energia dinàmica. Finalment, proposem un altre disseny d'arquitectura energèticament eficient que es pot aplicar a qualsevol tipus de memòria cache associativa per conjunts. La proposta, la Tag Filter (TF) Architecture, filtra les vies accedides en el conjunt de la cache, de manera que només un reduït nombre de vies es miren tant en el array d'etiquetes com en el de dades. Això permet que la nostra proposta redueixi el consum dinàmic energètic de les caches sense perjudicar el seu temps d'accés. Els resultats experimentals mostren que aquest mecanisme de filtre és capaç d'obtenir un consum energètic en caches associatives per conjunt similar al de les caches de mapejada directa. Els resultats experimentals mostren que les propostes presentades en aquesta tesis conseguixen un bon compromís entre aquestros tres importants pilars de diseny.
Valls Mompó, JJ. (2017). Improving Energy and Area Scalability of the Cache Hierarchy in CMPs [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/79551
TESIS
Xiang, Ping. "ANALYZING INSTRUCTTION BASED CACHE REPLACEMENT POLICIES." Master's thesis, University of Central Florida, 2010. http://digital.library.ucf.edu/cdm/ref/collection/ETD/id/2589.
Full textM.S.
School of Electrical Engineering and Computer Science
Engineering and Computer Science
Computer Engineering MSCpE
Ibrahim, Mohamed Assem Abd ElMohsen. "Rethinking Cache Hierarchy And Interconnect Design For Next-Generation Gpus." W&M ScholarWorks, 2020. https://scholarworks.wm.edu/etd/1627047836.
Full textIbrahim, Mohamed Assem Abd ElMohsen. "Rethinking Cache Hierarchy And Interconnect Design For Next-Generation Gpus." W&M ScholarWorks, 2021. https://scholarworks.wm.edu/etd/1627047836.
Full textDublish, Saumay Kumar. "Managing the memory hierarchy in GPUs." Thesis, University of Edinburgh, 2018. http://hdl.handle.net/1842/31205.
Full textSOHONI, SOHUM. "IMPROVING L2 CACHE PERFORMANCE THROUGH STREAM-DIRECTED OPTIMIZATIONS." University of Cincinnati / OhioLINK, 2004. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1092932892.
Full textDelgado, Nuno Miguel de Brito. "A system’s approach to cache hierarchy-aware decomposition of data-parallel computations." Master's thesis, Faculdade de Ciências e Tecnologia, 2014. http://hdl.handle.net/10362/13014.
Full textThe architecture of nowadays’ processors is very complex, comprising several computational cores and an intricate hierarchy of cache memories. The latter, in particular, differ considerably between the many processors currently available in the market, resulting in a wide variety of configurations. Application development is typically oblivious of this complexity and diversity, taking only into consideration the number of available execution cores. This oblivion prevents such applications from fully harnessing the computing power available in these architectures. This problem has been recognized by the community, which has proposed languages and models to express and tune applications according to the underlying machine’s hierarchy. These, however, lack the desired abstraction level, forcing the programmer to have deep knowledge of computer architecture and parallel programming, in order to ensure performance portability across a wide range of architectures. Realizing these limitations, the goal of this thesis is to delegate these hierarchy-aware optimizations to the runtime system. Accordingly, the programmer’s responsibilities are confined to the definition of procedures for decomposing an application’s domain, into an arbitrary number of partitions. With this, the programmer has only to reason about the application’s data representation and manipulation. We prototyped our proposal on top of a Java parallel programming framework, and evaluated it from a performance perspective, against cache neglectful domain decompositions. The results demonstrate that our optimizations deliver significant speedups against decomposition strategies based solely on the number of execution cores, without requiring the programmer to reason about the machine’s hardware. These facts allow us to conclude that it is possible to obtain performance gains by transferring hierarchyaware optimizations concerns to the runtime system.
Candel, Margaix Francisco. "Efficient L2 Cache Management to Boost GPGPU Performance." Doctoral thesis, Universitat Politècnica de València, 2019. http://hdl.handle.net/10251/125477.
Full text[CAT] En els últims anys, la creixent necessitat de capacitat de còmput ha suposat un repte que ha portat a la indústria a buscar arquitectures alternatives als processadors superescalars amb execució fora d'ordre convencionals, amb l'objectiu d'incrementar la potència de còmput alhora que s'aconsegueix una major eficiència energètica. Les arquitectures GPU, les quals fins fa només una dècada es dedicaven exclusivament a l'acceleració dels gràfics en els computadors, han sigut una de les alternatives més utilitzades durant alguns anys per a aconseguir l'esmentat objectiu. Una de les característiques particulars de les GPU és el seu elevat ample de banda per a accedir a memòria principal, la qual cosa permet executar un gran nombre de fils de forma molt eficient. Aquesta característica, així com la seua elevada potència computacional executant operacions de coma flotant, ha originat l'aparició del paradigma de computació anomenat GPGPU computing, paradigma on les GPU realitzen còmput de propòsit general. Les citades característiques converteixen a les GPU en dispositius especialment apropiats per a l'execució d'aplicacions massivament paral·leles que tradicionalment s'havien executat en processadors convencionals d'altes prestacions. El treball desenvolupat en aquesta tesi persegueix ajudar a millorar les prestacions de les GPU en l'execució de les aplicacions GPGPU. A aquest efecte, com a primer pas, es realitza un estudi de caracterització on s'identifiquen les característiques més importants d'aquestes aplicacions des del punt de vista de la jerarquia de memòria i el seu impacte en les prestacions. Per a això s'utilitza un simulador detallat cicle a cicle on es modela l'arquitectura d'una GPU recent. L'estudi revela que és necessari modelar de forma més detallada alguns components crítics de la jerarquia de memòria de les GPU per a obtindre resultats precisos. Els resultats obtinguts mostren que les prestacions aconseguides poden variar fins i tot en un factor de 3× depenent de com es modelen aquests components crítics. Per aquest motiu, com a segon pas abans d'elaborar la proposta de millora, el treball se centra en determinar quins components de la jerarquia de memòria de la GPU necessiten modelar-se amb major detall per a millorar la precisió dels resultats del simulador i en millorar els models existents d'aquests components. A més, es realitza un estudi de validació que compara els resultats obtinguts amb els models millorats contra els d'una GPU comercial real. Les millores implementades redueixen la desviació dels resultats del simulador sobre els resultats reals al voltant d'un 96%. Finalment, una vegada millorada la precisió del simulador, en aquesta tesi es presenta una proposta innovadora, denominada FRC (sigles en anglés de Fetch and Replacement Cache), que millora en gran manera la potència computacional de la GPU, gràcies a que augmenta el paral·lelisme en l'accés a memòria principal. La proposta incrementa el nombre d'accessos en paral·lel a memòria principal mitjançant l'acceleració de la gestió de les accions de recerca i reemplaçament relacionades amb els accessos que fallen en la cache. La proposta FRC es basa en una xicoteta estructura cache auxiliar que descongestiona el subsistema de memòria eficientment, augmentant les prestacions de la GPU fins a un 118% de mitjana respecte al sistema base. A més, també redueix, al voltant d'un 57%, el consum energètic de la jerarquia de memòria.
[EN] In recent years, the growing need for computing capacity has become a challenge that has led the industry to look for alternative architectures to conventional out-of-order superscalar processors, with the goal of enabling an increase of computing power while achieving higher energy efficiency. GPU architectures, which just a decade ago were applied to accelerate computer graphics exclusively, have been one of the most employed alternatives for several years to reach the mentioned goal. A particular characteristic of GPUs is their high main memory bandwidth, which allows executing a large number of threads in a very efficient way. This feature, as well as their high computational power regarding floating-point operations, have caused the emergence of the GPGPU computing paradigm, where GPU architectures perform general purpose computations. The aforementioned characteristics make GPU devices very appropriate for the execution of massively parallel applications that have been traditionally executed in conventional high-performance processors. The work performed in this thesis aims to help improve the performance of GPUs in the execution of GPGPU applications. To this end, as a first step, a characterization study is carried out. In this study, the most important features of GPGPU applications, with respect to the memory hierarchy and its impact on performance, are identified. For this purpose, a detailed cycle-accurate simulator is used to model the architecture of a recent GPU. The study reveals that it is necessary to model with more detail some critical components of the GPU memory hierarchy in order to obtain accurate results. In addition, it shows that the achieved benefits can vary up to a factor of 3× depending on how these critical components are modeled. Due to this reason, as a second step before realizing a novel proposal, the work in this thesis focuses on determining which components of the GPU memory hierarchy must be modeled with more detail to increase the accuracy of simulator results and improving the existing simulator models of these components. Moreover, a validation study is performed comparing the results obtained with the improved GPU models against those from a real commercial GPU. The implemented simulator improvements reduce the deviation of the results obtained with the simulator from results obtained with the real GPU by about 96%. Finally, once simulation accuracy is increased, this thesis proposes a novel approach, called FRC (Fetch and Replacement Cache), which highly improves the GPU computational power by enhancing main memory-level parallelism. The proposal increases the number of parallel accesses to main memory by accelerating the management of fetch and replacement actions corresponding to those cache accesses that miss in the cache. The FRC approach is based on a small auxiliary cache structure that efficiently unclogs the memory subsystem, enhancing the GPU performance up to 118% on average compared to the studied baseline. In addition, the FRC approach reduces the energy consumption of the memory hierarchy by a 57%.
Candel Margaix, F. (2019). Efficient L2 Cache Management to Boost GPGPU Performance [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/125477
TESIS
Davari, Mahdad. "Advances Towards Data-Race-Free Cache Coherence Through Data Classification." Doctoral thesis, Uppsala universitet, Avdelningen för datorteknik, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-320595.
Full textGhosh, Mrinmoy. "Microarchitectural techniques to reduce energy consumption in the memory hierarchy." Diss., Atlanta, Ga. : Georgia Institute of Technology, 2009. http://hdl.handle.net/1853/28265.
Full textCommittee Chair: Lee, Hsien-Hsin S.; Committee Member: Cahtterjee,Abhijit; Committee Member: Mukhopadhyay, Saibal; Committee Member: Pande, Santosh; Committee Member: Yalamanchili, Sudhakar.
Puche, Lara José. "Novel Cache Hierarchies with Photonic Interconnects for Chip Multiprocessors." Doctoral thesis, Universitat Politècnica de València, 2021. http://hdl.handle.net/10251/165254.
Full text[CA] Els processadors multinucli actuals compten amb recursos compartits entre els diferents nuclis. Dos d'aquests recursos compartits, la memòria d’últim nivell i l'ample de banda de memòria principal, poden convertir-se en colls d'ampolla per al rendiment. A mes, amb el creixement del nombre de nuclis que implementen els dissenys mes recents, la xarxa dins del xip també es converteix en un coll d'ampolla que pot afectar negativament el rendiment, ja que les xarxes tradicionals poden trobar limitacions a la seva escalabilitat en el futur proper. Pràcticament la totalitat dels dissenys actuals implementen jerarquies de memòria que es comuniquen mitjançant rapides xarxes d’interconnexió. Aquesta organització es eficaç ates que permet reduir el nombre d'accessos que es realitzen a memòria principal i la latència mitjana d’accés a memòria. Les caches, la xarxa d’interconnexió i la memòria principal, conjuntament amb altres tècniques conegudes com la prebúsqueda, permeten reduir les enormes latències d’accés a memòria principal, limitant així l'impacte negatiu ocasionat per la diferencia de rendiment existent entre els nuclis de còmput i la memòria. No obstant això, compartir els recursos esmentats és font de diversos problemes i reptes, sent un dels principals la gestió de la interferència entre aplicacions. Fer un us eficient de la jerarquia de memòria i les caches, així com comptar amb una xarxa d’interconnexió apropiada, es necessari per sostenir el creixement del rendiment en els dissenys tant actuals com futurs. Aquesta tesi analitza i estudia els principals problemes i inconvenients observats en aquests dos recursos: la memòria cache d’últim nivell i la xarxa dins del xip. En primer lloc, s'estudia l'escalabilitat de les xarxes tradicionals dins del xip amb topologia de malla, així com aquesta es pot veure compromesa en propers dissenys que compten amb major nombre de nuclis. Els resultats d'aquest estudi mostren que, a major nombre de nuclis, l'impacte negatiu de la distància entre nuclis en la latència pot afectar seriosament al rendiment del processador. Com a solució' a aquest problema, en aquesta tesi proposem una xarxa d’interconnexió' òptica modelada en un entorn de simulació detallat, que suposa una solució viable als problemes d'escalabilitat observats en els dissenys tradicionals. A continuació, aquesta tesi dedica un esforç important a identificar i proposar solucions als principals problemes de disseny de les jerarquies de memòria actuals com son, per exemple, el sobredimensionat de l'espai de memòria cache privat, l’existència de repliques de dades i la rigidesa i incapacitat d’adaptació' de les estructures de memòria cache. Encara que ben coneguts, aquests problemes i els seus efectes adversos en el rendiment poden ser evitats en processadors d'alt rendiment gracies a l'enorme capacitat de la memòria cache d’últim nivell que aquest tipus de processadors típicament implementen. No obstant això, en processadors de baix consum, no hi ha la possibilitat de comptar amb aquestes capacitats, i fer un us eficient de l'espai disponible es torna crític per mantenir el rendiment. Com a solució a aquests problemes en processadors de baix consum, proposem una nova organització de jerarquia de dos nivells de memòria cache que utilitza una xarxa d’interconnexió òptica. Els resultats obtinguts mostren que, comparat amb dissenys convencionals, el consum d'energia estàtica en l'arquitectura proposada és un 60% menor, malgrat que els resultats de rendiment presenten valors similars. Per últim, hem estes l'arquitectura proposada per donar suport tant a aplicacions paral·leles com seqüencials. Els resultats obtinguts amb aquesta nova arquitectura mostren un estalvi de fins al 78 % d'energia estàtica en l’execució d'aplicacions paral·leles.
[EN] Current multicores face the challenge of sharing resources among the different processor cores. Two main shared resources act as major performance bottlenecks in current designs: the off-chip main memory bandwidth and the last level cache. Additionally, as the core count grows, the network on-chip is also becoming a potential performance bottleneck, since traditional designs may find scalability issues in the near future. Memory hierarchies communicated through fast interconnects are implemented in almost every current design as they reduce the number of off-chip accesses and the overall latency, respectively. Main memory, caches, and interconnection resources, together with other widely-used techniques like prefetching, help alleviate the huge memory access latencies and limit the impact of the core-memory speed gap. However, sharing these resources brings several concerns, being one of the most challenging the management of the inter-application interference. Since almost every running application needs to access to main memory, all of them are exposed to interference from other co-runners in their way to the memory controller. For this reason, making an efficient use of the available cache space, together with achieving fast and scalable interconnects, is critical to sustain the performance in current and future designs. This dissertation analyzes and addresses the most important shortcomings of two major shared resources: the Last Level Cache (LLC) and the Network on Chip (NoC). First, we study the scalability of both electrical and optical NoCs for future multicoresand many-cores. To perform this study, we model optical interconnects in a cycle-accurate multicore simulation framework. A proper model is required; otherwise, important performance deviations may be observed otherwise in the evaluation results. The study reveals that, as the core count grows, the effect of distance on the end-to-end latency can negatively impact on the processor performance. In contrast, the study also shows that silicon nanophotonics are a viable solution to solve the mentioned latency problems. This dissertation is also motivated by important design concerns related to current memory hierarchies, like the oversizing of private cache space, data replication overheads, and lack of flexibility regarding sharing of cache structures. These issues, which can be overcome in high performance processors by virtue of huge LLCs, can compromise performance in low power processors. To address these issues we propose a more efficient cache hierarchy organization that leverages optical interconnects. The proposed architecture is conceived as an optically interconnected two-level cache hierarchy composed of multiple cache modules that can be dynamically turned on and off independently. Experimental results show that, compared to conventional designs, static energy consumption is improved by up to 60% while achieving similar performance results. Finally, we extend the proposal to support both sequential and parallel applications. This extension is required since the proposal adapts to the dynamic cache space needs of the running applications, and multithreaded applications's behaviors widely differ from those of single threaded programs. In addition, coherence management is also addressed, which is challenging since each cache module can be assigned to any core at a given time in the proposed approach. For parallel applications, the evaluation shows that the proposal achieves up to 78% static energy savings. In summary, this thesis tackles major challenges originated by the sharing of on-chip caches and communication resources in current multicores, and proposes new cache hierarchy organizations leveraging optical interconnects to address them. The proposed organizations reduce both static and dynamic energy consumption compared to conventional approaches while achieving similar performance; which results in better energy efficiency.
Puche Lara, J. (2021). Novel Cache Hierarchies with Photonic Interconnects for Chip Multiprocessors [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/165254
TESIS
Lodde, Mario. "Smart Memory and Network-On-Chip Design for High-Performance Shared-Memory Chip Multiprocessors." Doctoral thesis, Universitat Politècnica de València, 2014. http://hdl.handle.net/10251/35325.
Full textLodde, M. (2014). Smart Memory and Network-On-Chip Design for High-Performance Shared-Memory Chip Multiprocessors [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/35325
TESIS
Bangalore, Lakshminarayana Nagesh. "Efficient graph algorithm execution on data-parallel architectures." Diss., Georgia Institute of Technology, 2014. http://hdl.handle.net/1853/53058.
Full textMolka, Daniel. "Performance Analysis of Complex Shared Memory Systems." Doctoral thesis, Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden, 2017. http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-221729.
Full textRabbah, Rodric Michel. "Design Space Exploration and Optimization of Embedded Memory Systems." Diss., Georgia Institute of Technology, 2006. http://hdl.handle.net/1853/11605.
Full textBelhadj, Amor Hela. "Hiérarchie mémoire dans les systèmes intégrés multiprocesseurs construits autour de réseaux sur puce." Thesis, Université Grenoble Alpes (ComUE), 2017. http://www.theses.fr/2017GREAM049/document.
Full textMulti/many-cores parallel systems for high-power computing at low energy costs are nowadays a reality. However, exploiting the performance of these architectures depends on the efficiency of the system in managing data accesses. The aim of our work is to improve the efficiency of these accesses by exploiting the hardware architecture characteristics.In a first part, we propose a new cache hierarchy organization that aims at maximizing the use of the available storage space at each level. This solution, based on non-uniform cache access architectures (NUCA), supports inter and intra-level transfers of the hierarchy. It requires a cache coherency protocol that suits its specifications.Obviously, the transfer of data in the hierarchy is also a determinant of the system performance. In a second part, we consider the specific communication needs of the protocol. We suggest the use of a virtualized network as an ad-hoc communication medium to manage consistency traffic at a lower cost. It links the caches of the same level to support intra-level transfers, which are a specificity of our protocol, in order to reduce the average access latency
Lesage, Benjamin. "Architecture multi-coeurs et temps d'exécution au pire cas." Phd thesis, Université Rennes 1, 2013. http://tel.archives-ouvertes.fr/tel-00870971.
Full textBurget, Jaroslav. "Informační systém pro vyhodnocování aktivit pracovníků ve stromové hierarchii v Caché." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2008. http://www.nusl.cz/ntk/nusl-235993.
Full textXu, Sanlin, and SanlinXu@yahoo com. "Mobility Metrics for Routing in MANETs." The Australian National University. Faculty of Engineering and Information Technology, 2007. http://thesis.anu.edu.au./public/adt-ANU20070621.212401.
Full textPottier, Loïc. "Co-scheduling for large-scale applications : memory and resilience." Thesis, Lyon, 2018. http://www.theses.fr/2018LYSEN039/document.
Full textThis thesis explores co-scheduling problems in the context of large-scale applications with two main focus: the memory side, in particular the cache memory and the resilience side.With the recent advent of many-core architectures such as chip multiprocessors (CMP), the number of processing units is increasing.In this context, the benefits of co-scheduling techniques have been demonstrated. Recall that, the main idea behind co-scheduling is to execute applications concurrently rather than in sequence in order to improve the global throughput of the platform.But sharing resources often generates interferences.With the arising number of processing units accessing to the same last-level cache, those interferences among co-scheduled applications becomes critical.In addition, with that increasing number of processors the probability of a failure increases too.Resiliency aspects must be taking into account, specially for co-scheduling because failure-prone resources might be shared between applications.On the memory side, we focus on the interferences in the last-level cache, one solution used to reduce these interferences is the cache partitioning.Extensive simulations demonstrate the usefulness of co-scheduling when our efficient cache partitioning strategies are deployed.We also investigate the same problem on a real cache partitioned chip multiprocessors, using the Cache Allocation Technology recently provided by Intel.In a second time, still on the memory side, we study how to model and schedule task graphs on the new many-core architectures, such as Knights Landing architecture.These architectures offer a new level in the memory hierarchy through a new on-packagehigh-bandwidth memory. Current approaches usually do not take intoaccount this new memory level, however new scheduling algorithms anddata partitioning schemes are needed to take advantage of this deepmemory hierarchy.On the resilience, we explore the impact on failures on co-scheduling performance.The co-scheduling approach has been demonstrated in a fault-free context, but large-scale computer systems are confronted by frequent failures, and resilience techniques must be employed for large applications to execute efficiently. Indeed, failures may create severe imbalance between applications, and significantly degrade performance.We aim at minimizing the expected completion time of a set of co-scheduled applications in a failure-prone context by redistributing processors
Kaeslin, Alain E. "Performance Optimisation of Discrete-Event Simulation Software on Multi-Core Computers." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-191132.
Full textSIMLOX är en kommersiell mjukvara utvecklad av Systecon AB, vars huvudsakliga funktion är en händelsestyrd simuleringskärna för analys av underhållslösningar för komplexa tekniska system. För hantering av stora problem så används parallellexekvering för simuleringen, vilket i teorin borde ge en nästan linjär skalning med antal trådar. Prestandaförbättringen som observerats i praktiken var dock ytterst begränsad, varför en ordentlig analys av skalbarheten har gjorts i detta projekt. Genom användandet av ett profileringsverktyg med liten overhead och mikroarkitektur-analys, så kunde orsakerna hittas: atomiska operationer som skapar mycket overhead för kommunikation, dålig lokalitet ger fragmentering vid översättning till fysiska adresser och dåligt utnyttjande av TLB-cachen, och vissa flaskhalsar som kräver mycket CPU-kraft. Därefter implementerades och testade optimeringar för att undvika de identifierade problem. Testade lösningar inkluderar eliminering av dyra operationer, ökad effektivitet i minneshantering genom skalbara minneshanteringsalgoritmer och implementation av datastrukturer som ger bättre lokalitet och därmed bättre användande av cache-strukturen. Verifiering på verkliga testfall visade på uppsnabbningar på åtminstone 6.75 gånger på en processor med 8 kärnor. De flesta fall visade på en uppsnabbning med en faktor större än 7.2. Optimeringarna gav även en uppsnabbning med en faktor på åtminstone 1.5 vid sekventiell exekvering i en tråd. Slutsatsen är därmed att det är möjligt att uppnå nästan linjär skalning med antalet kärnor för denna typ av händelsestyrd simulering.
Péneau, Pierre-Yves. "Intégration de technologies de mémoires non volatiles émergentes dans la hiérarchie de caches pour améliorer l'efficacité énergétique." Thesis, Montpellier, 2018. http://www.theses.fr/2018MONTS108/document.
Full textToday, intensive efforts to design energy-efficient and high-performance systems-on-chip (SoCs) are underway. Moore’s end in the early 20 th century pushed designers to increase the number of core per processor to continue to improve the performance. As a result, the silicon area occupied by cache memories has increased. The ever smaller technology node also increased the leakage current of CMOS transistors. Thus, the energy consumption of memories represents an increasingly important part in the overall consumption of chips.To reduce this energy consumption, new memory technologies have emerged overthe past decade : non-volatile memories (NVM). These memories have the particularity of having a very low leakage current compared to conventional CMOS technologies. In fact, their use in an architecture would reduce the overall consumption of the cache hierarchy. However, these technologies sufferfrom higher access latencies than SRAM, higher access energy costs and limitedlifetime. Their integration into SoCs requires a continuous research effort.This thesis work aims to evaluate the impact of a change in technology in the cache hierarchy. More specifically, we are interested in the Last-Level Cache(LLC) and we consider the STT-MRAM technology. Our work adopts an architectural point of view in which a modification of the technology is not retained. Then,we try to integrate the different characteristics of the STT-MRAM atarchitectural level when designing the memory hierarchy. A first study set upan architectural exploration framework for systems containing emerging memories. A second study on architectural optimizations at LLC was conducted toidentify opportunities for the integration of STT-MRAM. The goal is to improve energy efficiency while reducing access penalties due to the high latency ofthis technology
Etcheverry, Arnaud. "Simulation de la dynamique des dislocations à très grande échelle." Thesis, Bordeaux, 2015. http://www.theses.fr/2015BORD0263/document.
Full textThis research work focuses on bringing performances in 3D dislocation dynamics simulation, to run efficiently on modern computers. First of all, we introduce some algorithmic technics, to reduce the complexity in order to target large scale simulations. Second of all, we focus on data structure to take into account both memory hierachie and algorithmic data access. On one side we build this adaptive data structure to handle dynamism of data and on the other side we use an Octree to combine hierachie decompostion and data locality in order to face intensive arithmetics with force field computation and collision detection. Finnaly, we introduce some parallel aspects of our simulation. We propose a classical hybrid parallelism, with task based openMP threads and domain decomposition technics for MPI
Kuo, Hsien-Kai, and 郭玹凱. "A Cache Hierarchy Aware Thread Mapping Methodology for Throughput Processors." Thesis, 2014. http://ndltd.ncl.edu.tw/handle/47296669409808050941.
Full text國立交通大學
電子工程學系 電子研究所
102
Deploying the cache system is an effective way to alleviate the memory bottleneck in modern throughput processors, such as GPGPUs. The recently proposed throughput oriented architecture has added a multi-level hierarchy of shared cache to better exploit the data locality of general purpose applications. The design philosophy of throughput processor allocates most of the chip area to processing cores, and thus results in a relatively small cache shared by a large number of cores when compared with conventional multi-core CPUs. Applying a proper thread mapping scheme is crucial for gaining from constructive cache sharing and avoiding resource contention among thousands of threads. However, due to the significant differences on architectures and programming models, the existing thread mapping approaches for multi-core CPUs do not perform as effective on throughput processors. In throughput processors, a commonly-used thread mapping policy is to render the maximum possible Thread-Level-Parallelism (TLP). However, this greedy policy could usually cause serious cache contention on the Shared Last-Level-Cache (SLLC) and significantly degrades the system performance. It is therefore a critical performance factor that the thread mapping of a throughput processor performs a careful trade-off between the thread-level-parallelism and cache contention. This dissertation first proposes a formal model to capture both the characteristics of threads as well as the cache sharing behavior of multi-level shared cache. With appropriate proofs, the model forms a solid theoretical foundation beneath the proposed cache hierarchy aware thread mapping methodology for multi-level shared cache GPGPUs. The experiments reveal that the three-staged thread mapping methodology can successfully improve the data reuse on each cache level of GPGPUs and achieve an average of 2.3x to 4.3x runtime enhancement when compared with existing approaches. This dissertation then extends the discussion to further characterize and analyze the performance impact of cache contention in the SLLC of throughput processors. Based on the analyses and findings of cache contention and its performance pitfalls, this dissertation formally formulates the Aggregate-Working-Set-Size Constrained Thread Scheduling Problem, which constrains the aggregate-working-set-size on concurrent threads. With a proof to be NP-hard, this paper has integrated a series of algorithms to minimize the cache contention and enhances the overall system performance on GPGPUs. The simulation results on NVIDIA’s Fermi architecture have shown that the proposed thread scheduling scheme achieves up to 61.6% execution time enhancement over a widely-used thread clustering scheme. When compared to the state-of-the-art technique that exploits the data reuse of applications, the improvement on execution time can reach 47.4%. Notably, the execution time improvement of the proposed thread scheduling scheme is only 2.6% from an exhaustive searching scheme.
Hsu, Chia-Chang, and 許家彰. "The Design of a Fast Multi-Level Cache Hierarchy Simulator." Thesis, 1995. http://ndltd.ncl.edu.tw/handle/77539040639092976867.
Full text中原大學
資訊工程研究所
83
Trace-driven simulation method has been widely accepted to perform cache performance analysis. However, long execution time for the simulations is the major drawback of the trace- driven simulation method. Several researches were done to reduce the simulation time for single-level caches. Multi-level cache hier- archies are memory systems in which there are two or more levels of caching between the CPU and main memory. Many advanced com- mercial microprocessors devote some of their valuable on-chip area to caches in which the capacity of the caches is very limited. Therefore, an off-chip cache on board is required to avoid frequent accesses to relatively slow main memory. The on- chip cache and the off-chip cache form a multi- level cache hier- archy located between CPU and memory. The complexity of multi-level cache hierarchies simulation is much more complicate than that of single-level cache simulation, so that it is highly desirable to develop some efficient simul- ation methods for multi-level cache hierarchies performance analysis. In this paper, a one pass multi-level unified cache hierarchies simulation method is developed based on stack model and inclusion properties.Recently,many processors utilized split instruction and data caches in the first cache level. A one pass multi-level hybrid cache simulation method is proposed for more generalized hierarchies. These two one pass simulation methods are implemented.The simulation results show the effectiveness of these two one pass simulation methods.
Huang, Annsi, and 黃安賜. "Hierarchy Proxy Cache Based On Different Level Load Sharing With Dynamic Hash." Thesis, 2002. http://ndltd.ncl.edu.tw/handle/15670731107593626578.
Full text元智大學
資訊工程學系
90
The growth of Internet has increased rapidly with the wide-spread popularity of the World-Wide Web (WWW). There are enormous amount of duplicated web objects through the network over and over again. As a consequence, user latency for accessing the web resource has become unsatisfactory. Web proxy cache is one popular approach to reduce the user latency for accessing the WWW resources. The dramatic increase of Internet users arises the network traffic overloading. The proxy cache system must face such traffic spikes. An overloaded proxy system will create more network traffic than non-proxy mode users for the sake of the queries timeout. In the case, users must try to connect original web servers again. In this thesis, we proposed a load sharing policy based on the hierarchy proxy cache structure. We take advantage of the hierarchy characteristics and make use of the system resources more efficiently at the higher level. We make the load sharing occur at different levels with dynamic hash routing policy for the overloading status of the hierarchy proxy cache system. By doing so, network traffic spikes could be reduced and keep the latency for the proxy mode users with our load balance mechanism.
Molka, Daniel. "Performance Analysis of Complex Shared Memory Systems." Doctoral thesis, 2016. https://tud.qucosa.de/id/qucosa%3A30223.
Full textKhun, Jush Farshad. "Architectural enhancement for message passing interconnects." Thesis, 2008. http://hdl.handle.net/1828/1225.
Full textChang, Hua-long, and 張華隆. "The Use of Analytical Hierarchy Process for Teaching Evaluation Satisfaction - A Cace Study of a Senior High School." Thesis, 2009. http://ndltd.ncl.edu.tw/handle/47740028800296526070.
Full text義守大學
工業工程與管理學系碩士班
97
Nowadays, the quality of teaching is an important issue for school competition. This research uses the Analytic Hierarchy Process (AHP) to investigate the satisfaction evaluation of the senior high school teacher’s performance. The questionnaire investigation is used to find the Student Ratings of Instruction and the instructional evaluation. This research will discuss about the teacher’s performance, the AHP(Analysis Hierarchy Process) will be used as analysis method to evaluate and come out the expert questionnaire. First, this questionnaire will be filled in by teacher to get the relative weight as the indicator of teacher’s performance. Then, this questionnaire will be sent to students for review and then they will summarize some factors to evaluate the teacher''s performance. The following development of the research is : 1. The instructional evaluation can offer school useful reference information. 2. This research for Analytic Hierarchy Process method can become more useful. 3. The instructional evaluation can be used in many way like transferred and promoted and assess choosing, the market research of products accepts degree, etc.