To see the other types of publications on this topic, follow the link: On-chip memory.

Dissertations / Theses on the topic 'On-chip memory'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'On-chip memory.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Lodde, Mario. "Smart Memory and Network-On-Chip Design for High-Performance Shared-Memory Chip Multiprocessors." Doctoral thesis, Universitat Politècnica de València, 2014. http://hdl.handle.net/10251/35325.

Full text
Abstract:
La jerarquía de caches y la red en el chip (NoC) son dos componentes clave de los chip multiprocesadores (CMPs). La mayoría del trafico en la NoC se debe a mensajes que las caches envían según lo que establece el protocolo de coherencia. La cantidad de trafico, el porcentaje de mensajes cortos y largos y el patrón de trafico en general varían dependiendo de la geometría de las caches y del protocolo de coherencia. La arquitectura de la NoC y la jerarquía de caches están de hecho firmemente acopladas, y estos dos componentes deben ser diseñados y evaluados conjuntamente para estudiar como el variar uno afecta a las prestaciones del otro. Además, cada componente debe ajustarse a los requisitos y a las oportunidades del otro, y al revés. Normalmente diferentes clases de mensajes se envían por diferentes redes virtuales o por NoCs con diferente ancho de banda, separando mensajes largos y cortos. Sin embargo, otra clasificación de los mensajes se puede hacer dependiendo del tipo de información que proveen: algunos mensajes, como las peticiones de datos, necesitan campos para almacenar información (dirección del bloque, tipo de petición, etc.); otros, como los mensajes de reconocimiento (ACK), no proporcionan ninguna información excepto por el ID del nodo destino: solo proveen una información de tipo temporal, en el sentido que la recepción de un ACK indica que el nodo fuente ha recibido el mensaje al que está contestando con el ACK y completado todas las operaciones determinadas por el protocolo de coherencia. Esta segunda clase de mensaje no necesita de mucho ancho de banda: la latencia es mucho mas importante, dado que el nodo destino esta típicamente bloqueado esperando la recepción de ellos. En este trabajo de tesis se desarrolla una red dedicada para trasmitir la segunda clase de mensajes; la red es muy sencilla y rápida, y permite la entrega de los ACKs con una latencia de pocos ciclos de reloj. Reduciendo la latencia y el trafico en la NoC debido a los ACKs, es posible: -acelerar la fase de invalidación en fase de escritura en un sistema que usa un protocolo de coherencia basado en directorios -mejorar las prestaciones de un protocolo de coerencia basado en broadcast, hasta llegar a prestaciones comparables con las de un protocolo de directorios pero sin el coste de área debido a la necesidad de almacenar el directorio -implementar un mapeado dinámico de bloques a las caches de ultimo nivel de forma eficiente, con el objetivo de acercar cuanto al máximo los bloques a los cores que los utilizan El objetivo final es obtener un co-diseño de NoC y jerarquía de caches que minimice los problemas de escalabilidad de los protocolos de coherencia. Como gran objetivo final, se pretende la implementación de un CMP con ubicación dinámica de los recursos de cache y red, tal que estos recursos se puedan particionar de forma eficiente e independiente para asignar diferentes particiones a diferentes aplicaciones en un entorno virtualizado.<br>Lodde, M. (2014). Smart Memory and Network-On-Chip Design for High-Performance Shared-Memory Chip Multiprocessors [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/35325<br>TESIS
APA, Harvard, Vancouver, ISO, and other styles
2

Yang, Shufan. "Memory interconnect management on a chip multiprocessor." Thesis, University of Manchester, 2010. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.520682.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Cook, Henry Michael. "Productive Design of Extensible On-Chip Memory Hierarchies." Thesis, University of California, Berkeley, 2016. http://pqdtopen.proquest.com/#viewpdf?dispub=10150942.

Full text
Abstract:
<p> As Moore&rsquo;s Law slows and process scaling yields only small returns, computer architecture and design are poised to undergo a renaissance. This thesis brings the productivity of modern software tools to bear on the design of future energy-efficient hardware architectures. </p><p> In particular, it targets one of the most difficult design tasks in the hardware domain: Coherent hierarchies of on-chip caches. I have extended the capabilities of Chisel, a new hardware description language, by providing libraries for hardware developers to use to describe the configuration and behavior of such memory hierarchies, with a focus on the cache coherence protocols that work behind the scenes to preserve their abstraction of global shared memory. I discuss how the methods I provide enable productive and extensible memory hierarchy design by separating the concerns of different hierarchy components, and I explain how this forms the basis for a generative approach to agile hardware design. </p><p> This thesis describes a general framework for context-dependent parameterization of any hardware generator, defines a specific set of Chisel libraries for generating extensible cache-coherent memory hierarchies, and provides a methodology for decomposing high-level descriptions of cache coherence protocols into controller-localized, object-oriented transactions. </p><p> This methodology has been used to generate the memory hierarchies of a lineage of RISC-V chips fabricated as part of the ASPIRE Lab&rsquo;s investigations into application-specific processor design.</p>
APA, Harvard, Vancouver, ISO, and other styles
4

Dimić, Vladimir. "Runtime-assisted optimizations in the on-chip memory hierarchy." Doctoral thesis, Universitat Politècnica de Catalunya, 2020. http://hdl.handle.net/10803/670363.

Full text
Abstract:
Following Moore's Law, the number of transistors on chip has been increasing exponentially, which has led to the increasing complexity of modern processors. As a result, the efficient programming of such systems has become more difficult. Many programming models have been developed to answer this issue. Of particular interest are task-based programming models that employ simple annotations to define parallel work in an application. The information available at the level of the runtime systems associated with these programming models offers great potential for improving hardware design. Moreover, due to technological limitations, Moore's Law is predicted to eventually come to an end, so novel paradigms are necessary to maintain the current performance improvement trends. The main goal of this thesis is to exploit the knowledge about a parallel application available at the runtime system level to improve the design of the on-chip memory hierarchy. The coupling of the runtime system and the microprocessor enables a better hardware design without hurting the programmability. The first contribution is a set of insertion policies for shared last-level caches that exploit information about tasks and task data dependencies. The intuition behind this proposal revolves around the observation that parallel threads exhibit different memory access patterns. Even within the same thread, accesses to different variables often follow distinct patterns. The proposed policies insert cache lines into different logical positions depending on the dependency type and task type to which the corresponding memory request belongs. The second proposal optimizes the execution of reductions, defined as a programming pattern that combines input data to form the resulting reduction variable. This is achieved with a runtime-assisted technique for performing reductions in the processor's cache hierarchy. The proposal's goal is to be a universally applicable solution regardless of the reduction variable type, size and access pattern. On the software level, the programming model is extended to let a programmer specify the reduction variables for tasks, as well as the desired cache level where a certain reduction will be performed. The source-to-source compiler and the runtime system are extended to translate and forward this information to the underlying hardware. On the hardware level, private and shared caches are equipped with functional units and the accompanying logic to perform reductions at the cache level. This design avoids unnecessary data movements to the core and back as the data is operated at the place where it resides. The third contribution is a runtime-assisted prioritization scheme for memory requests inside the on-chip memory hierarchy. The proposal is based on the notion of a critical path in the context of parallel codes and a known fact that accelerating critical tasks reduces the execution time of the whole application. In the context of this work, task criticality is observed at a level of a task type as it enables simple annotation by the programmer. The acceleration of critical tasks is achieved by the prioritization of corresponding memory requests in the microprocessor.<br>Siguiendo la ley de Moore, el número de transistores en los chips ha crecido exponencialmente, lo que ha comportado una mayor complejidad en los procesadores modernos y, como resultado, de la dificultad de la programación eficiente de estos sistemas. Se han desarrollado muchos modelos de programación para resolver este problema; un ejemplo particular son los modelos de programación basados en tareas, que emplean anotaciones sencillas para definir los Trabajos paralelos de una aplicación. La información de que disponen los sistemas en tiempo de ejecución (runtime systems) asociada con estos modelos de programación ofrece un enorme potencial para la mejora del diseño del hardware. Por otro lado, las limitaciones tecnológicas hacen que la ley de Moore pueda dejar de cumplirse próximamente, por lo que se necesitan paradigmas nuevos para mantener las tendencias actuales de mejora de rendimiento. El objetivo principal de esta tesis es aprovechar el conocimiento de las aplicaciones paral·leles de que dispone el runtime system para mejorar el diseño de la jerarquía de memoria del chip. El acoplamiento del runtime system junto con el microprocesador permite realizar mejores diseños hardware sin afectar Negativamente en la programabilidad de dichos sistemas. La primera contribución de esta tesis consiste en un conjunto de políticas de inserción para las memorias caché compartidas de último nivel que aprovecha la información de las tareas y las dependencias de datos entre estas. La intuición tras esta propuesta se basa en la observación de que los hilos de ejecución paralelos muestran distintos patrones de acceso a memoria e, incluso dentro del mismo hilo, los accesos a diferentes variables a menudo siguen patrones distintos. Las políticas que se proponen insertan líneas de caché en posiciones lógicas diferentes en función de los tipos de dependencia y tarea a los que corresponde la petición de memoria. La segunda propuesta optimiza la ejecución de las reducciones, que se definen como un patrón de programación que combina datos de entrada para conseguir la variable de reducción como resultado. Esto se consigue mediante una técnica asistida por el runtime system para la realización de reducciones en la jerarquía de la caché del procesador, con el objetivo de ser una solución aplicable de forma universal sin depender del tipo de la variable de la reducción, su tamaño o el patrón de acceso. A nivel de software, el modelo de programación se extiende para que el programador especifique las variables de reducción de las tareas, así como el nivel de caché escogido para que se realice una determinada reducción. El compilador fuente a Fuente (compilador source-to-source) y el runtime ssytem se modifican para que traduzcan y pasen esta información al hardware subyacente, evitando así movimientos de datos innecesarios hacia y desde el núcleo del procesador, al realizarse la operación donde se encuentran los datos de la misma. La tercera contribución proporciona un esquema de priorización asistido por el runtime system para peticiones de memoria dentro de la jerarquía de memoria del chip. La propuesta se basa en la noción de camino crítico en el contexto de los códigos paralelos y en el hecho conocido de que acelerar tareas críticas reduce el tiempo de ejecución de la aplicación completa. En el contexto de este trabajo, la criticidad de las tareas se considera a nivel del tipo de tarea ya que permite que el programador las indique mediante anotaciones sencillas. La aceleración de las tareas críticas se consigue priorizando las correspondientes peticiones de memoria en el microprocesador.<br>Seguint la llei de Moore, el nombre de transistors que contenen els xips ha patit un creixement exponencial, fet que ha provocat un augment de la complexitat dels processadors moderns i, per tant, de la dificultat de la programació eficient d’aquests sistemes. Per intentar solucionar-ho, s’han desenvolupat diversos models de programació; un exemple particular en són els models basats en tasques, que fan servir anotacions senzilles per definir treballs paral·lels dins d’una aplicació. La informació que hi ha al nivell dels sistemes en temps d’execució (runtime systems) associada amb aquests models de programació ofereix un gran potencial a l’hora de millorar el disseny del maquinari. D’altra banda, les limitacions tecnològiques fan que la llei de Moore pugui deixar de complir-se properament, per la qual cosa calen nous paradigmes per mantenir les tendències actuals en la millora de rendiment. L’objectiu principal d’aquesta tesi és aprofitar els coneixements que el runtime System té d’una aplicació paral·lela per millorar el disseny de la jerarquia de memòria dins el xip. L’acoblament del runtime system i el microprocessador permet millorar el disseny del maquinari sense malmetre la programabilitat d’aquests sistemes. La primera contribució d’aquesta tesi consisteix en un conjunt de polítiques d’inserció a les memòries cau (cache memories) compartides d’últim nivell que aprofita informació sobre tasques i les dependències de dades entre aquestes. La intuïció que hi ha al darrere d’aquesta proposta es basa en el fet que els fils d’execució paral·lels mostren diferents patrons d’accés a la memòria; fins i tot dins el mateix fil, els accessos a variables diferents sovint segueixen patrons diferents. Les polítiques que s’hi proposen insereixen línies de la memòria cau a diferents ubicacions lògiques en funció dels tipus de dependència i de tasca als quals correspon la petició de memòria. La segona proposta optimitza l’execució de les reduccions, que es defineixen com un patró de programació que combina dades d’entrada per aconseguir la variable de reducció com a resultat. Això s’aconsegueix mitjançant una tècnica assistida pel runtime system per dur a terme reduccions en la jerarquia de la memòria cau del processador, amb l’objectiu que la proposta sigui aplicable de manera universal, sense dependre del tipus de la variable a la qual es realitza la reducció, la seva mida o el patró d’accés. A nivell de programari, es realitza una extensió del model de programació per facilitar que el programador especifiqui les variables de les reduccions que usaran les tasques, així com el nivell de memòria cau desitjat on s’hauria de realitzar una certa reducció. El compilador font a font (compilador source-to-source) i el runtime system s’amplien per traduir i passar aquesta informació al maquinari subjacent. A nivell de maquinari, les memòries cau privades i compartides s’equipen amb unitats funcionals i la lògica corresponent per poder dur a terme les reduccions a la pròpia memòria cau, evitant així moviments de dades innecessaris entre el nucli del processador i la jerarquia de memòria. La tercera contribució proporciona un esquema de priorització assistit pel runtime System per peticions de memòria dins de la jerarquia de memòria del xip. La proposta es basa en la noció de camí crític en el context dels codis paral·lels i en el fet conegut que l’acceleració de les tasques que formen part del camí crític redueix el temps d’execució de l’aplicació sencera. En el context d’aquest treball, la criticitat de les tasques s’observa al nivell del seu tipus ja que permet que el programador les indiqui mitjançant anotacions senzilles. L’acceleració de les tasques crítiques s’aconsegueix prioritzant les corresponents peticions de memòria dins el microprocessador.
APA, Harvard, Vancouver, ISO, and other styles
5

Chen, Dongliang. "Intelligent Efficient On-Chip Memory for Mobile Video Streaming." Thesis, North Dakota State University, 2017. https://hdl.handle.net/10365/30199.

Full text
Abstract:
The growing popularity of powerful mobile devices such as smart phones and tablet devices has resulted in the exponential growth of demand for video applications. User experience and battery life are both crucial topics in the advancement of these devices. However, due to the high power consumption of mobile video decoders, especially the on-chip memories, short battery life represents one of the biggest contributors to user dissatisfaction. Various mobile embedded memory techniques have been investigated to reduce power consumption and prolong battery life. Unfortunately, the existing hardware-level research suffers from high implementation complexity and large overhead. In this thesis, we focus on smart power-efficient memory design, considering both user?s viewing experience and low power memory design. Our results shows up to 57.2% power saving is achieved in VCAS and 43.7% power saving is achieved in D-DASH with negligible area cost.<br>National Science Foundation (U.S.)<br>National Science Foundation (Grant CCF-1514780)
APA, Harvard, Vancouver, ISO, and other styles
6

Pourbakhsh, Seyed Alireza. "Dummy TSV-Based Timing Optimization for 3D On-Chip Memory." Thesis, North Dakota State University, 2016. https://hdl.handle.net/10365/29093.

Full text
Abstract:
Design and fabrication of three-dimensional (3D) ICs is one the newest and hottest trends in semiconductor manufacturing industry. In 3D ICs, multiple 2D silicon dies are stacked vertically, and through silicon vias (TSVs) are used to transfer power and signals between different dies. The electrical characteristic of TSVs can be modeled with equivalent circuits consisted of passive elements. In this thesis, we use “dummy” TSVs as electrical delay units in 3D SRAMs. Our results prove that dummy TSVs based delay units are as effective as conventional delay cells in performance, increase the operational frequency of SRAM up to 110%, reduce the usage of silicon area up to 88%, induce negligible power overhead, and improve robustness against voltage supply variation and fluctuation.
APA, Harvard, Vancouver, ISO, and other styles
7

KASAT, AMIT. "MEMORY SYNTHESIS FOR FPGA-BASED RECONFIGURABLE COMPUTERS." University of Cincinnati / OhioLINK, 2001. http://rave.ohiolink.edu/etdc/view?acc_num=ucin988222220.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Shalan, Mohamed A. "Dynamic memory management for embedded real-time multiprocessor system-on-a-chip." Diss., Available online, Georgia Institute of Technology, 2003:, 2003. http://etd.gatech.edu/theses/available/etd-11252003-131621/unrestricted/shalanmohameda200312.pdf.

Full text
Abstract:
Thesis (Ph. D.)--Electrical and Computer Engineering, Georgia Institute of Technology, 2004.<br>Vincent Mooney, Committee Chair; John Barry, Committee Member; James Hamblen, Committee Member; Karsten Schwan, Committee Member; Linda Wills, Committee Member. Includes bibliography.
APA, Harvard, Vancouver, ISO, and other styles
9

Chen, Zhi. "Power-Efficient and Low-Latency Memory Access for CMP Systems with Heterogeneous Scratchpad On-Chip Memory." UKnowledge, 2013. http://uknowledge.uky.edu/ece_etds/25.

Full text
Abstract:
The gradually widening speed disparity of between CPU and memory has become an overwhelming bottleneck for the development of Chip Multiprocessor (CMP) systems. In addition, increasing penalties caused by frequent on-chip memory accesses have raised critical challenges in delivering high memory access performance with tight power and latency budgets. To overcome the daunting memory wall and energy wall issues, this thesis focuses on proposing a new heterogeneous scratchpad memory architecture which is configured from SRAM, MRAM, and Z-RAM. Based on this architecture, we propose two algorithms, a dynamic programming and a genetic algorithm, to perform data allocation to different memory units, therefore reducing memory access cost in terms of power consumption and latency. Extensive and intensive experiments are performed to show the merits of the heterogeneous scratchpad architecture over the traditional pure memory system and the effectiveness of the proposed algorithms.
APA, Harvard, Vancouver, ISO, and other styles
10

Bonatto, Alexsandro Cristóvão. "Controle adaptativo para acesso à memória compartilhada em sistemas em chip." reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 2014. http://hdl.handle.net/10183/109193.

Full text
Abstract:
Acessos simultâneos gerados por Elementos de Processamento (EP) contidos nos Sistemas em Chip (SoC) para um único canal de memória externa coloca desafios que requerem uma atenção especial por constituírem o gargalo para o desempenho de processamento. No caso em que os EPs são microprocessadores, a questão fica ainda mais evidente, pois a taxa de aumento da velocidade dos microprocessadores excede a taxa de aumento da velocidade da DRAM. Ambas aumentam exponencialmente, mas a expoente dos microprocessadores é maior do que a das memórias. Este efeito é denominado de “muro de memória” (Memory Wall) e representa que o gargalo de processamento está relacionado à diferença de velocidade. Neste cenário, novas estratégias de controle de acesso são necessárias para melhorar o desempenho. Plataformas heterogêneas de processamento multimídia são formadas por diversos EPs. Os acessos con- correntes à regiões de memória não contíguas em uma DRAM reduzem a largura de banda e aumentam a latência de acesso aos dados, degradando o desempenho de processamento. Esta tese mostra que a eficiência computacional pode ser melhorada com o uso de um fluxo de projeto centralizado em memória, ou seja, orientado para os aspectos funcionais da DRAM. Neste trabalho é apresentado um subsistema de memória com gerenciamento adaptativo de compar- tilhamento do canal de memória entre múltiplos clientes. Esta tese apresenta a arquitetura de um controlador de memória com comportamento predizível que faz a avaliação do pior caso de execução para as transações solicitadas pelos clientes em tempo de execução. Um modelo baseado em atrasos é utilizado para prever os piores casos para o conjunto de clientes. O sub-sistema de memória centraliza a comunicação de dados e gerencia os acessos dos diversos EPs do sistema, de forma que a comunicação seja atendida de acordo com as necessidades de cada aplicação. São apresentadas três contribuições principais: 1) um método de projeto de sistemas integrados centralizado em memória, que orienta o projeto para os aspectos funcionais da me- mória compartilhada; 2) um modelo baseado em atrasos para estimar o pior caso de execução do sistema, quanto aos tempos de resposta e largura de banda mínima alocada por cliente; 3) um árbitro adaptativo para gerenciamento dos acessos à memória externa com garantias de prazos de execução das transações.<br>The number of Processing Elements (PE) contained in a System-on-Chip (SoC) follows the growth of the number of transistors per chip. A SoC composed of multiple PEs, in some ap- plications such as multimedia, implements algorithms that handle large volumes of data and justify the use of an external memory with large capacity. External memory accesses are shared by multiple PEs adding challenges that may have special attention because they constitute the bottleneck for performance and relevant factor for power consumption. In the case where the PEs are microprocessors, this issue becomes even more evident as the rate of increase of speed of microprocessors exceeds the rate of increase in speed of DRAM. This effect is called “mem- ory wall” and represents that the bottleneck processing is related to the speed of data access. In this scenario, new access control strategies are needed to improve processing performance. Heterogeneous platforms for multimedia processing are formed by several PEs. The concur- rent accesses to DRAM reduce bandwidth and increase latency access to data, degrading the processing performance. This thesis shows that significant improvements in computational effi- ciency can be obtained using a design methodology oriented to the functional aspects of DRAM through a memory subsystem with adaptive management. It is presented the data communica- tion architecture for integration of PEs system based on an analytical model to reduce latency and guarantee Quality of Service (QoS). The memory subsystem is organized as a hierarchy of memories, with a proposed integration of PEs oriented centered in the main memory. The memory subsystem centralized data communication and manages the access of several PEs sys- tem so that communication is served according to the needs of each application. This thesis proposes three major contributions: 1) a methodology for design integrated systems based on the memory-centric design approach, 2) an analytical model based on delays used to evaluate the worst-case performance of the memory subsystem, 3) an arbiter for adaptive management of accesses to the external memory with guaranteed execution times of transactions.
APA, Harvard, Vancouver, ISO, and other styles
11

Naeem, Abdul. "Architecture Support and Scalability Analysis of Memory Consistency Models in Network-on-Chip based Systems." Doctoral thesis, KTH, Elektroniksystem, 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-117700.

Full text
Abstract:
The shared memory systems should support parallelization at the computation (multi-core), communication (Network-on-Chip, NoC) and memory architecture levels to exploit the potential performance benefits. These parallel systems supporting shared memory abstraction both in the general purpose and application specific domains are confronting the critical issue of memory consistency. The memory consistency issue arises due to the unconstrained memory operations which leads to the unexpected behavior of shared memory systems. The memory consistency models enforce ordering constraints on the memory operations for the expected behavior of the shared memory systems. The intuitive Sequential Consistency (SC) model enforces strict ordering constraints on the memory operations and does not take advantage of the system optimizations both in the hardware and software. Alternatively, the relaxed memory consistency models relax the ordering constraints on the memory operations and exploit these optimizations to enhance the system performance at the reasonable cost. The purpose of this thesis is twofold. First, the novel architecture supports are provided for the different memory consistency models like: SC, Total Store Ordering (TSO), Partial Store Ordering (PSO), Weak Consistency (WC), Release Consistency (RC) and Protected Release Consistency (PRC) in the NoC-based multi-core (McNoC) systems. The PRC model is proposed as an extension of the RC model which provides additional reordering and relaxation in the memory operations. Second, the scalability analysis of these memory consistency models is performed in the McNoC systems. The architecture supports for these different memory consistency models are provided in the McNoC platforms. Each configurable McNoC platform uses a packet-switched 2-D mesh NoC with deflection routing policy, distributed shared memory (DSM), distributed locks and customized processor interface. The memory consistency models/protocols are implemented in the customized processor interfaces which are developed to integrate the processors with the rest of the system. The realization schemes for the memory consistency models are based on a transaction counter and an an an address ddress ddress ddress ddress ddress ddress stack tacktack-based based based based based based novel approaches.approaches.approaches.approaches. approaches.approaches.approaches.approaches.approaches.approaches. The transaction counter is used in each node of the network to keep track of the outstanding memory operations issued by a processor in the system. The address stack is used in each node of the network to keep track of the addresses of the outstanding memory operations issued by a processor in the system. These hardware structures are used in the processor interface to enforce the required global orders under these different memory consistency models. The realization scheme of the PRC model in addition also uses acquire counter for further classification of the data operations as unprotected and protected operations. The scalability analysis of these different memory consistency models is performed on the basis of different workloads which are developed and mapped on the various sized networks. The scalability study is conducted in the McNoC systems with 1 to 64-cores with various applications using different problem sizes and traffic patterns. The performance metrics like execution time, performance, speedup, overhead and efficiency are evaluated as a function of the network size. The experiments are conducted both with the synthetic and application workloads. The experimental results under different application workloads show that the average execution time under the relaxed memory consistency models decreases relative to the SC model. The specific numbers are highly sensitive to the application and depend on how well it matches to the architectures. This study shows the performance improvement under the relaxed memory consistency models over the SC model that is dependent on the computation-to-communication ratio, traffic patterns, data-to-synchronization ratio and the problem size. The performance improvement of the PRC and RC models over the SC model tends to be higher than 50% as observed in the experiments, when the system is further scaled up.<br><p>QC 20130204</p>
APA, Harvard, Vancouver, ISO, and other styles
12

Simons, Brad, and Brad Simons. "Set-Associative History-Aided Adaptive Replacement for On-Chip Caches." Thesis, The University of Arizona, 2016. http://hdl.handle.net/10150/621128.

Full text
Abstract:
Last Level Caches (LLCs) are critical to reducing processor stalls to off-chip memory and improving processing throughput, and replacement policy plays an important role in the performance of LLCs. Many replacement algorithms are designed to be thrash-resistant to protect the working set in the cache from scans, but a fundamental challenge is balancing thrash-resistance to changes to the working set over time as an application executes. In this thesis a novel Set-Associative History-Aided Adaptive Replacement Cache (SHARC) LLC replacement algorithm is proposed, which adjusts scan-resistance at run-time based on the current memory access properties of the application. This policy segregates the cache to protect the working set from scans and utilizes history information from recently evicted cache lines to increase or decrease amount of cache reserved for the working set. On average, SHARC improves IPC by approximately 11% over LRU replacement policy while only requiring 14% increase in overhead. The SHARC-NRU replacement policy is also proposed to reduce this overhead and achieves approximately 10% performance improvement and requires 11% less overhead than LRU.
APA, Harvard, Vancouver, ISO, and other styles
13

Omar, Omar Jaber. "An On-Chip Memory for Testing of High-Speed Mixed-Signal Circuits." Thesis, Linköpings universitet, Elektroniska komponenter, 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-103800.

Full text
Abstract:
Mixed-signal processing systems especially data converters can be reliably tested at high frequencies using on-chip testing schemes based on memory. In this thesis, an on-chip testing strategy based on shift registers/memory (2 k bits) has been proposed for digital-to-analog converters (DACs) operating at 5 GHz. The proposed design uses word length of 8 bits in order to test DAC at high speed of 5 GHz. The proposed testing strategy has been designed in standard 65 nm CMOS technology with additional requirement of 1-V supply. This design has been implemented using Cadence IC design environment. The additional advantage of the proposed testing strategy is that it requires lower number of I/O pins and avoids the large number of high speed I/O pads. It therefore also solves the problem of the bandwidth limitation that is associated with I/O transmission paths. The design of the on-chip tester based on memory contains no analog block and is implemented entirely in digital domain. In the proposed design, low frequency of 1 MHz has been used outside the chip to load the data into the memory during the write mode. During the read mode, the frequency of 625 MHz is used to read the data from the memory. A multiplexing system is used to reuse the stored data during read mode to test the intended functionality and performance. In order to convert the parallel data into serial data at high frequency at the memory output, serializer has been used. By using the frequencies of 1.25 GHz and 2.5 GHz, the serializer speeds up the data from the lower frequency of 625 MHz to the highest frequency of 5 GHz in order to test DAC at 5 GHz.
APA, Harvard, Vancouver, ISO, and other styles
14

Strobel, Manuel [Verfasser], and Martin [Akademischer Betreuer] Radetzki. "Design-time system-on-chip memory optimization / Manuel Strobel ; Betreuer: Martin Radetzki." Stuttgart : Universitätsbibliothek der Universität Stuttgart, 2020. http://d-nb.info/1215101880/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
15

Kunz, Leonardo. "Memória transacional em hardware para sistemas embarcados multiprocessados conectados por redes-em-chip." reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 2010. http://hdl.handle.net/10183/28739.

Full text
Abstract:
A Memória Transacional (TM) surgiu nos últimos anos como uma nova solução para sincronização em sistemas multiprocessados de memória compartilhada, permitindo explorar melhor o paralelismo das aplicações ao evitar limitações inerentes ao mecanismo de locks. Neste modelo, o programador define regiões de código que devem executar de forma atômica. O sistema tenta executá-las de forma concorrente, e, em caso de conflito nos acessos à memória, toma as medidas necessárias para preservar a atomicidade e isolamento das transações, na maioria das vezes abortando e reexecutando uma das transações. Um dos modelos mais aceitos de memória transacional em hardware é o LogTM, implementado neste trabalho em um MPSoC embarcado que utiliza uma NoC para interconexão. Os experimentos fazem uma comparação desta implementação com locks, levando-se em consideração performance e energia do sistema. Além disso, este trabalho mostra que o tempo que uma transação espera para reiniciar sua execução após ter abortado (chamado de backoff delay on abort) tem impactos significativos na performance e energia. Uma análise deste impacto é feita utilizando-se de três políticas de backoff. Um mecanismo baseado em um handshake entre transações, chamado Abort handshake, é proposto como solução para o problema. Os resultados dos experimentos são dependentes da aplicação e configuração do sistema e indicam ganhos da TM na maioria dos casos em relação ao mecanismo de locks. Houve redução de até 30% no tempo de execução e de até 32% na energia de aplicações de baixa demanda de sincronização. Em um segundo momento, é feita uma análise do backoff delay on abort na performance e energia de aplicações utilizando três políticas de backoff em comparação com o mecanismo Abort handshake. Os resultados mostram que o mecanismo proposto apresenta redução de até 20% no tempo de execução e de até 53% na energia comparado à melhor política de backoff dentre as analisadas. Para aplicações com alta demanda de sincronização, a TM mostra redução no tempo de execução de até 63% e redução de energia de até 71% em comparação com o mecanismo de locks.<br>Transactional Memory (TM) has emerged in the last years as a new solution for synchronization on shared memory multiprocessor systems, allowing a better exploration of the parallelism of the applications by avoiding inherent limitations of the lock mechanism. In this model, the programmer defines regions of code, called transactions, to execute atomically. The system tries to execute transactions concurrently, but in case of conflict on memory accesses, it takes the appropriate measures to preserve the atomicity and isolation, usually aborting and re-executing one of the transactions. One of the most accepted hardware transactional memory model is LogTM, implemented in this work in an embedded MPSoC that uses an NoC as interconnection mechanism. The experiments compare this implementation with locks, considering performance and energy. Furthermore, this work shows that the time a transaction waits to restart after abort (called backoff delay on abort) has significant impact on performance and energy. An analysis of this impact is done using three backoff policies. A novel mechanism based on handshake of transactions, called Abort handshake, is proposed as a solution to this issue. The results of the experiments depends on application and system configuration and show TM benefits in most cases in comparison to the locks mechanism, reaching reduction on the execution time up to 30% and reduction on the energy consumption up to 32% on low contention workloads. After that, an analysis of the backoff delay on abort on the performance and energy is presented, comparing to the Abort handshake mechanism. The proposed mechanism shows reduction of up to 20% on the execution time and up to 53% on the energy, when compared to the best backoff policy. For applications with a high degree of synchronization, TM shows reduction on the execution time up to 63% and energy savings up to 71% compared to locks.
APA, Harvard, Vancouver, ISO, and other styles
16

ARORA, VIKRAM. "AN EFFICIENT BUILT-IN SELF-DIAGNOSTIC METHOD FOR NON-TRADITIONAL FAULTS OF EMBEDDED MEMORY ARRAYS." University of Cincinnati / OhioLINK, 2002. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1037998809.

Full text
APA, Harvard, Vancouver, ISO, and other styles
17

Bhide, Kanchan P. "DESIGN ENHANCEMENT AND INTEGRATION OF A PROCESSOR-MEMORY INTERCONNECT NETWORK INTO A SINGLE-CHIP MULTIPROCESSOR ARCHITECTURE." UKnowledge, 2004. http://uknowledge.uky.edu/gradschool_theses/253.

Full text
Abstract:
This thesis involves modeling, design, Hardware Description Language (HDL) design capture, synthesis, implementation and HDL virtual prototype simulation validation of an interconnect network for a Hybrid Data/Command Driven Computer Architecture (HDCA) system. The HDCA is a single-chip shared memory multiprocessor architecture system. Various candidate processor-memory interconnect topologies that may meet the requirements of the HDCA system are studied and evaluated related to utilization within the HDCA system. It is determined that the Crossbar network topology best meets the HDCA system requirements and it is therefore used as the processormemory interconnect network of the HDCA system. The design capture, synthesis, implementation and HDL simulation is done in VHDL using XILINX ISE 6.2.3i and ModelSim 5.7g CAD softwares. The design is validated by individually testing against some possible test cases and then integrated into the HDCA system and validated against two different applications. The inclusion of crossbar switch in the HDCA architecture involved major modifications to the HDCA system and some minor changes in the design of the switch. Virtual Prototype testing of the HDCA executing applications when utilizing crossbar interconnect revealed proper functioning of the interconnect and HDCA. Inclusion of the interconnect into the HDCA now allows it to implement dynamic node level reconfigurability and multiple forking functionality.
APA, Harvard, Vancouver, ISO, and other styles
18

Kwon, Woo Cheol. "Co-design of on-chip caches and networks for scalable shared-memory many-core CMPs." Thesis, Massachusetts Institute of Technology, 2018. http://hdl.handle.net/1721.1/118084.

Full text
Abstract:
Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2018.<br>Cataloged from PDF version of thesis.<br>Includes bibliographical references (pages 169-180).<br>Chip Multi-Processors(CMPs) have become mainstream in recent years, providing increased parallelism as core counts scale. While a tiled CMP is widely accepted to be a scalable architecture for the many-core era, on-chip cache organization and coherence are far from solved problems. As the on-chip interconnect directly influences the latency and bandwidth of on-chip cache, scalable interconnect is an essential part of on-chip cache design. On the other hand, optimal design of interconnect can be determined by the traffic forms that it should handle. Thus, on-chip cache organization is inherently interleaved with on-chip interconnect design and vice versa. This dissertation aims to motivate the need for re-organization of on-chip caches to leverage the advancement of on-chip network technology to harness the full potential of future many-core CMPs. Conversely, we argue that on-chip network should also be designed to support specific functionalities required by the on-chip cache. We propose such co-design techniques to offer significant improvement of on-chip cache performance, and thus to provide scalable CMP cache solutions towards future many-core CMPs. The dissertation starts with the problem of remote on-chip cache access latency. Prior locality-aware approaches fundamentally attempt to keep data as close as possible to the requesting cores. In this dissertation, we challenge this design approach by introducing new cache organization that leverages a co-designed on-chip network that allows multi-hop single-cycle traversals. Next, the dissertation moves to cache coherence request ordering. Without built-in ordering capability within the interconnect, cache coherence protocols have to rely on external ordering points. This dissertation proposes a scalable ordered Network-on-Chip which supports ordering of requests for snoopy cache coherence. Lastly, we describe development of a 36-core research prototype chip to demonstrate that the proposed Network-on-Chip enables shared-memory CMPs to be readily scalable to many-core platforms.<br>by Woo Cheol Kwon.<br>Ph. D.
APA, Harvard, Vancouver, ISO, and other styles
19

Akgul, Bilge Ebru Saglam. "The System-on-a-Chip Lock Cache." Diss., Georgia Institute of Technology, 2004. http://hdl.handle.net/1853/5253.

Full text
Abstract:
In this dissertation, we implement efficient lock-based synchronization by a novel, high performance, simple and scalable hardware technique and associated software for a target shared-memory multiprocessor System-on-a-Chip (SoC). The custom hardware part of our solution is provided in the form of an intellectual property (IP) hardware unit which we call the SoC Lock Cache (SoCLC). SoCLC provides effective lock hand-off by reducing on-chip memory traffic and improving performance in terms of lock latency, lock delay and bandwidth consumption. The proposed solution is independent from the memory hierarchy, cache protocol and the processor architectures used in the SoC, which enables easily applicable implementations of the SoCLC (e.g., as a reconfigurable or partially/fully custom logic), and which distinguishes SoCLC from previous approaches. Furthermore, the SoCLC mechanism has been extended to support priority inheritance with an immediate priority ceiling protocol (IPCP) implemented in hardware, which enhances the hard real-time performance of the system. Our experimental results in a four-processor SoC indicate that SoCLC can achieve up to 37% overall speedup over spin-lock and up to 48% overall speedup over MCS for a microbenchmark with false sharing. The priority inheritance implemented as part of the SoCLC hardware, on the other hand, achieves 1.43X speedup in overall execution time of a robot application when compared to the priority inheritance implementation under the Atalanta real-time operating system. Furthermore, it has been shown that with the IPCP mechanism integrated into the SoCLC, all of the tasks of the robot application could meet their deadlines (e.g., a high priority task with 250us worst case response time could complete its execution in 93us with SoCLC, however the same task missed its deadline by completing its execution in 283us without SoCLC). Therefore, with IPCP support, our solution can provide better real-time guarantees for real-time systems. To automate SoCLC design, we have also developed an SoCLC-generator tool, PARLAK, that generates user specified configurations of a custom SoCLC. We used PARLAK to generate SoCLCs from a version for two processors with 32 lock variables occupying 2,520 gates up to a version for fourteen processors with 256 lock variables occupying 78,240 gates.
APA, Harvard, Vancouver, ISO, and other styles
20

Puche, Lara José. "Novel Cache Hierarchies with Photonic Interconnects for Chip Multiprocessors." Doctoral thesis, Universitat Politècnica de València, 2021. http://hdl.handle.net/10251/165254.

Full text
Abstract:
[ES] Los procesadores multinúcleo actuales cuentan con recursos compartidos entre los diferentes núcleos. Dos de estos recursos compartidos, la cache de último nivel y el ancho de banda de memoria principal, pueden convertirse en cuellos de botella para el rendimiento. Además, con el crecimiento del número de núcleos que implementan los diseños más recientes, la red dentro del chip también se convierte en un cuello de botella que puede afectar negativamente al rendimiento, ya que las redes tradicionales pueden encontrar limitaciones a su escalabilidad en el futuro cercano. Prácticamente la totalidad de los diseños actuales implementan jerarquías de memoria que se comunican mediante rápidas redes de interconexión. Esta organización es eficaz dado que permite reducir el número de accesos que se realizan a memoria principal y la latencia media de acceso a memoria. Las caches, la red de interconexión y la memoria principal, conjuntamente con otras técnicas conocidas como la prebúsqueda, permiten reducir las enormes latencias de acceso a memoria principal, limitando así el impacto negativo ocasionado por la diferencia de rendimiento existente entre los núcleos de cómputo y la memoria. Sin embargo, compartir los recursos mencionados es fuente de diferentes problemas y retos, siendo uno de los principales el manejo de la interferencia entre aplicaciones. Hacer un uso eficiente de la jerarquía de memoria y las caches, así como contar con una red de interconexión apropiada, es necesario para sostener el crecimiento del rendimiento en los diseños tanto actuales como futuros. Esta tesis analiza y estudia los principales problemas e inconvenientes observados en estos dos recursos: la cache de último nivel y la red dentro del chip. En primer lugar, se estudia la escalabilidad de las tradicionales redes dentro del chip con topología de malla, así como esta puede verse comprometida en próximos diseños que cuenten con mayor número de núcleos. Los resultados de este estudio muestran que, a mayor número de núcleos, el impacto negativo de la distancia entre núcleos en la latencia puede afectar seriamente al rendimiento del procesador. Como solución a este problema, en esta tesis proponemos una de red de interconexión óptica modelada en un entorno de simulación detallado, que supone una solución viable a los problemas de escalabilidad observados en los diseños tradicionales. A continuación, esta tesis dedica un esfuerzo importante a identificar y proponer soluciones a los principales problemas de diseño de las jerarquías de memoria actuales como son, por ejemplo, el sobredimensionado del espacio de cache privado, la existencia de réplicas de datos y rigidez e incapacidad de adaptación de las estructuras de cache. Aunque bien conocidos, estos problemas y sus efectos adversos en el rendimiento pueden ser evitados en procesadores de alto rendimiento gracias a la enorme capacidad de la cache de último nivel que este tipo de procesadores típicamente implementan. Sin embargo, en procesadores de bajo consumo, no existe la posibilidad de contar con tales capacidades y hacer un uso eficiente del espacio disponible es crítico para mantener el rendimiento. Como solución a estos problemas en procesadores de bajo consumo, proponemos una novedosa organización de jerarquía de dos niveles cache que utiliza una red de interconexión óptica. Los resultados obtenidos muestran que, comparado con diseños convencionales, el consumo de energía estática en la arquitectura propuesta es un 60% menor, pese a que los resultados de rendimiento presentan valores similares. Por último, hemos extendido la arquitectura propuesta para dar soporte tanto a aplicaciones paralelas como secuenciales. Los resultados obtenidos con la esta nueva arquitectura muestran un ahorro de hasta el 78 % de energía estática en la ejecución de aplicaciones paralelas.<br>[CA] Els processadors multinucli actuals compten amb recursos compartits entre els diferents nuclis. Dos d'aquests recursos compartits, la memòria d’últim nivell i l'ample de banda de memòria principal, poden convertir-se en colls d'ampolla per al rendiment. A mes, amb el creixement del nombre de nuclis que implementen els dissenys mes recents, la xarxa dins del xip també es converteix en un coll d'ampolla que pot afectar negativament el rendiment, ja que les xarxes tradicionals poden trobar limitacions a la seva escalabilitat en el futur proper. Pràcticament la totalitat dels dissenys actuals implementen jerarquies de memòria que es comuniquen mitjançant rapides xarxes d’interconnexió. Aquesta organització es eficaç ates que permet reduir el nombre d'accessos que es realitzen a memòria principal i la latència mitjana d’accés a memòria. Les caches, la xarxa d’interconnexió i la memòria principal, conjuntament amb altres tècniques conegudes com la prebúsqueda, permeten reduir les enormes latències d’accés a memòria principal, limitant així l'impacte negatiu ocasionat per la diferencia de rendiment existent entre els nuclis de còmput i la memòria. No obstant això, compartir els recursos esmentats és font de diversos problemes i reptes, sent un dels principals la gestió de la interferència entre aplicacions. Fer un us eficient de la jerarquia de memòria i les caches, així com comptar amb una xarxa d’interconnexió apropiada, es necessari per sostenir el creixement del rendiment en els dissenys tant actuals com futurs. Aquesta tesi analitza i estudia els principals problemes i inconvenients observats en aquests dos recursos: la memòria cache d’últim nivell i la xarxa dins del xip. En primer lloc, s'estudia l'escalabilitat de les xarxes tradicionals dins del xip amb topologia de malla, així com aquesta es pot veure compromesa en propers dissenys que compten amb major nombre de nuclis. Els resultats d'aquest estudi mostren que, a major nombre de nuclis, l'impacte negatiu de la distància entre nuclis en la latència pot afectar seriosament al rendiment del processador. Com a solució' a aquest problema, en aquesta tesi proposem una xarxa d’interconnexió' òptica modelada en un entorn de simulació detallat, que suposa una solució viable als problemes d'escalabilitat observats en els dissenys tradicionals. A continuació, aquesta tesi dedica un esforç important a identificar i proposar solucions als principals problemes de disseny de les jerarquies de memòria actuals com son, per exemple, el sobredimensionat de l'espai de memòria cache privat, l’existència de repliques de dades i la rigidesa i incapacitat d’adaptació' de les estructures de memòria cache. Encara que ben coneguts, aquests problemes i els seus efectes adversos en el rendiment poden ser evitats en processadors d'alt rendiment gracies a l'enorme capacitat de la memòria cache d’últim nivell que aquest tipus de processadors típicament implementen. No obstant això, en processadors de baix consum, no hi ha la possibilitat de comptar amb aquestes capacitats, i fer un us eficient de l'espai disponible es torna crític per mantenir el rendiment. Com a solució a aquests problemes en processadors de baix consum, proposem una nova organització de jerarquia de dos nivells de memòria cache que utilitza una xarxa d’interconnexió òptica. Els resultats obtinguts mostren que, comparat amb dissenys convencionals, el consum d'energia estàtica en l'arquitectura proposada és un 60% menor, malgrat que els resultats de rendiment presenten valors similars. Per últim, hem estes l'arquitectura proposada per donar suport tant a aplicacions paral·leles com seqüencials. Els resultats obtinguts amb aquesta nova arquitectura mostren un estalvi de fins al 78 % d'energia estàtica en l’execució d'aplicacions paral·leles.<br>[EN] Current multicores face the challenge of sharing resources among the different processor cores. Two main shared resources act as major performance bottlenecks in current designs: the off-chip main memory bandwidth and the last level cache. Additionally, as the core count grows, the network on-chip is also becoming a potential performance bottleneck, since traditional designs may find scalability issues in the near future. Memory hierarchies communicated through fast interconnects are implemented in almost every current design as they reduce the number of off-chip accesses and the overall latency, respectively. Main memory, caches, and interconnection resources, together with other widely-used techniques like prefetching, help alleviate the huge memory access latencies and limit the impact of the core-memory speed gap. However, sharing these resources brings several concerns, being one of the most challenging the management of the inter-application interference. Since almost every running application needs to access to main memory, all of them are exposed to interference from other co-runners in their way to the memory controller. For this reason, making an efficient use of the available cache space, together with achieving fast and scalable interconnects, is critical to sustain the performance in current and future designs. This dissertation analyzes and addresses the most important shortcomings of two major shared resources: the Last Level Cache (LLC) and the Network on Chip (NoC). First, we study the scalability of both electrical and optical NoCs for future multicoresand many-cores. To perform this study, we model optical interconnects in a cycle-accurate multicore simulation framework. A proper model is required; otherwise, important performance deviations may be observed otherwise in the evaluation results. The study reveals that, as the core count grows, the effect of distance on the end-to-end latency can negatively impact on the processor performance. In contrast, the study also shows that silicon nanophotonics are a viable solution to solve the mentioned latency problems. This dissertation is also motivated by important design concerns related to current memory hierarchies, like the oversizing of private cache space, data replication overheads, and lack of flexibility regarding sharing of cache structures. These issues, which can be overcome in high performance processors by virtue of huge LLCs, can compromise performance in low power processors. To address these issues we propose a more efficient cache hierarchy organization that leverages optical interconnects. The proposed architecture is conceived as an optically interconnected two-level cache hierarchy composed of multiple cache modules that can be dynamically turned on and off independently. Experimental results show that, compared to conventional designs, static energy consumption is improved by up to 60% while achieving similar performance results. Finally, we extend the proposal to support both sequential and parallel applications. This extension is required since the proposal adapts to the dynamic cache space needs of the running applications, and multithreaded applications's behaviors widely differ from those of single threaded programs. In addition, coherence management is also addressed, which is challenging since each cache module can be assigned to any core at a given time in the proposed approach. For parallel applications, the evaluation shows that the proposal achieves up to 78% static energy savings. In summary, this thesis tackles major challenges originated by the sharing of on-chip caches and communication resources in current multicores, and proposes new cache hierarchy organizations leveraging optical interconnects to address them. The proposed organizations reduce both static and dynamic energy consumption compared to conventional approaches while achieving similar performance; which results in better energy efficiency.<br>Puche Lara, J. (2021). Novel Cache Hierarchies with Photonic Interconnects for Chip Multiprocessors [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/165254<br>TESIS
APA, Harvard, Vancouver, ISO, and other styles
21

Dublish, Saumay Kumar. "Managing the memory hierarchy in GPUs." Thesis, University of Edinburgh, 2018. http://hdl.handle.net/1842/31205.

Full text
Abstract:
Pervasive use of GPUs across multiple disciplines is a result of continuous adaptation of the GPU architectures to address the needs of upcoming application domains. One such vital improvement is the introduction of the on-chip cache hierarchy, used primarily to filter the high bandwidth demand to the off-chip memory. However, in contrast to traditional CPUs, the cache hierarchy in GPUs is presented with significantly different challenges such as cache thrashing and bandwidth bottlenecks, arising due to small caches and high levels of memory traffic. These challenges lead to severe congestion across the memory hierarchy, resulting in high memory access latencies. In memory-intensive applications, such high memory access latencies often get exposed and can no longer be hidden through multithreading, and therefore adversely impact system performance. In this thesis, we address the inefficiencies across the memory hierarchy in GPUs that lead to such high levels of congestion. We identify three major factors contributing to poor memory system performance: first, disproportionate and insufficient bandwidth resources in the cache hierarchy; second, poor cache management policies; and third, high levels of multithreading. In order to revitalize the memory hierarchy by addressing the above limitations, we propose a three-pronged approach. First, we characterize the bandwidth bottlenecks present across the memory hierarchy in GPUs and identify the architectural parameters that are most critical in alleviating congestion. Subsequently, we explore the architectural design space to mitigate the bandwidth bottlenecks in a cost-effective manner. Second, we identify significant inter-core reuse in GPUs, presenting an opportunity to reuse data among the L1s. We exploit this reuse by connecting the L1 caches with a lightweight ring network to facilitate inter-core communication of shared data. We show that this technique reduces traffic to the L2 cache, freeing up the bandwidth for other accesses. Third, we present Poise, a machine learning approach to mitigate cache thrashing and bandwidth bottlenecks by altering the levels of multi-threading. Poise comprises a supervised learning model that is trained offline on a set of profiled kernels to make good warp scheduling decisions. Subsequently, a hardware inference engine is used to predict good warp scheduling decisions at runtime using the model learned during training. In summary, we address the problem of bandwidth bottlenecks across the memory hierarchy in GPUs by exploring how to best scale, supplement and utilize the existing bandwidth resources. These techniques provide an effective and comprehensive methodology to mitigate the bandwidth bottlenecks in the GPU memory hierarchy.
APA, Harvard, Vancouver, ISO, and other styles
22

Shiomi, Jun. "Performance Modeling and On-Chip Memory Structures for Minimum Energy Operation in Voltage-Scaled LSI Circuits." Kyoto University, 2017. http://hdl.handle.net/2433/228252.

Full text
APA, Harvard, Vancouver, ISO, and other styles
23

Bonatto, Alexsandro Cristóvão. "Núcleos de interface de memória DDR SDRAM para sistemas-em-chip." reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 2009. http://hdl.handle.net/10183/17291.

Full text
Abstract:
Dispositivos integrados de sistemas-em-chip (SoC), especialmente aqueles dedicados às aplicações multimídia, processam grandes quantidades de dados armazenados em memórias. O desempenho das portas de memória afeta diretamente no desempenho do sistema. A melhor utilização do espaço de armazenamento de dados e a redução do custo e do consumo de potência dos sistemas eletrônicos encorajam o desenvolvimento de arquiteturas eficientes para controladores de memória. Essa melhoria deve ser alcançada tanto para interfaces com memórias internas quanto externas ao chip. Em sistemas de processamento de vídeo, por exemplo, memórias de grande capacidade são necessárias para armazenar vários quadros de imagem enquanto que os algoritmos de compressão fazem a busca por redundâncias. No caso de sistemas implementados em tecnologia FPGA é possível utilizar os blocos de memória disponíveis internamente ao FPGA, os quais são limitados a poucos mega-bytes de dados. Para aumentar a capacidade de armazenamento de dados é necessário usar elementos de memória externa e um núcleo de propriedade intelectual (IP) de controlador de memória é necessário. Contudo, seu desenvolvimento é uma tarefa muito complexa e nem sempre é possível utilizar uma solução "sob demanda". O uso de FPGAs para prototipar sistemas permite ao desenvolvedor integrar módulos rapidamente. Nesse caso, a verificação do projeto é uma questão importante a ser considerada no desenvolvimento de um sistema complexo. Controladores de memória de alta velocidade são extremamente sensíveis aos atrasos de propagação da lógica e do roteamento. A síntese a partir de uma descrição em linguagem de hardware (HDL) necessita da verificação de sua compatibilidade com as especificações de temporização pré-determinadas. Como solução para esse problema, é apresentado nesse trabalho um IP do controlador de memória DDR SDRAM com função de BIST (Built-In Self-Test) integrada, onde o teste de memória é utilizado para verificar o funcionamento correto do controlador.<br>Many integrated Systems-on-Chip (SoC) devices, specially those dedicated to multimedia applications, process large amounts of data stored on memories. The performance of the memories ports directly affects the performance of the system. Optimization of the usage of data storage and reduction of cost and power consumption of the electronic systems encourage the development of efficient architectures for memory controllers. This improvement must be reached either for embedded or external memories. In systems for video processing, for example, large memory arrays are needed to store several video frames while compression algorithms search for redundancies. In the case of FPGA system implementation, it is possible to use memory blocks available inside FPGA, but for only a few megabytes of data. To increase data storage capacity it is necessary to use external memory devices and a memory controller intellectual property (IP) core is required. Nevertheless, its development is a very complex task and it is not always possible to have a custom solution. Using FPGA for system prototyping allows the developer to perform rapid integration of modules to exercise a hardware version. In this case, test is an important issue to be considered in a complex system design. High speed memory controllers are very sensitive to gate and routing delays and the synthesis from a hardware description language (HDL) needs to be verified to comply with predefined timing specifications. To overcome these problems, a DDR SDRAM controller IP was developed which integrate the BIST (Built-In Self-Test) function, where the memory test is used to check the correct functioning of the DDR controller.
APA, Harvard, Vancouver, ISO, and other styles
24

Gnawali, Krishna Prasad. "EMERGING MEMORY-BASED DESIGNS AND RESILIENCY TO RADIATION EFFECTS IN ICS." OpenSIUC, 2020. https://opensiuc.lib.siu.edu/dissertations/1863.

Full text
Abstract:
The performance of a modern computing system is improving with technology scaling due to advancements in the modern semiconductor industry. However, the power efficiency along with reliability does not scale linearly with performance efficiency. High leakage and standby power in sub 100 nm technology are critical challenges faced by circuit designers. Recent developments in device physics have shown that emerging non-volatile memories are very effective in reducing power dissipation because they eliminate stand by power and exhibit almost zero leakage powerThis dissertation studies the use of emerging non-volatile memory devices in designing circuit architecture for improving power dissipation and the performance of the computing system. More specically, it proposes a novel spintronic Ternary Content AddressableMemory (TCAM), a novel memristive TCAM with improved power and performance efficiency. Our experimental evaluation on 45 nm technology for a 256-bit word-size spintronic TCAM at a supply voltage of 1 V with a sense margin of 50 mV show that the delay is lessthan 200 ps and the per-bit search energy is approximately 3 fJ. The proposed spintronic TCAM consumes at least 30% less energy when compared to state-of-the-art TCAM designs. The search delay on a 144-bit proposed memristive TCAM at a supply voltage of 1 V and a sense margin of 140 mV is 175 ps with per bit search energy of 1.2 fJ on a 45 nm technology. It is 1.12 x times faster and dissipates 67% less search energy per bit than the fastest existing 144-bit MTCAM design.Emerging non-volatile memories are well known for their ability to perform fast analog multiplication and addition when they are arranged in crossbar fashion and are especially suited for neural network applications. However, such systems require the on-chip implementation of the backpropagation algorithm to accommodate process variations. This dissertation studies the impact of process variation in training memristive neural network architecture. It proposes a low hardware overhead on-chip implementation of the backpropagation algorithm that utilizes effectively the very dense memristive cross-bar arrayand is resilient to process variations.Another important issue that needs a careful study due to shrinking technology node is the impact of space or terrestrial radiation in Integrated Circuits (ICs) because the probability of a high energy particle causing an error increases with a decrease in thethreshold voltage and the noise margin. Moreover, single-event effects (SEEs) sensitivity depends on the set of input vectors used at the time of testing due to logical masking. This dissertation analyzes the impact of input test set on the cross section of the microprocessorand proposes a mechanism to derive a high-quality input test set using an automatic test pattern generation (ATPG) for radiation testing of microprocessors arithmetic and logical units..
APA, Harvard, Vancouver, ISO, and other styles
25

Sampaio, Felipe Martin. "Energy-efficient memory architecture design and management for parallel video coding." reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 2018. http://hdl.handle.net/10183/179534.

Full text
Abstract:
Esta tese de doutorado apresenta o projeto de uma arquitetura de memória híbrida energeticamente eficiente baseada em memórias do tipo scratchpad (Hy-SVM) para a codificação paralela de vídeos segundo o padrão HEVC. A codificação de vídeo se destaca como uma parte extremamente complexa nas aplicações de processamento de vídeo. O padrão HEVC traz inovações que complicam fortemente os requerimentos de memória de tais aplicações, principalmente devido a: (a) novas estruturas de codificação, as quais agravam a complexidade computacional por proporcionarem muitas modos possíveis de codificação que devem ser analisados; além do (b) suporte de alto nível à paralelização da codificação por meio do particionamento das unidades de codificação em múltiplos Tiles, o qual provê a aceleração da performance dos codificadores, porém, ao mesmo tempo, adiciona grandes desafios à infraestrutura de memória. O principal gargalo em termos de comunicação com a memória externa e de armazenamento interno (dentro do chip do codificador) é dados pelas informações dos quadros de referência: que consiste em uma série de quadros completos já codificados (e reconstruídos) que devem ser mantidos em memória e acessados de forma intensa durante o processamento dos quadros futuros. Devido ao grande volume de dados que são necessários para representar os quadros de referência, estes são tipicamente armazenados na memória externa dos codificadores (principalmente quando vídeos de alta e ultra alta resolução são processados) A arquitetura proposta Hy-SVM está inserida em um sistema de codificação baseado no particionamento dos quadros do vídeo de entrada em múltiplos Tiles, de forma a habilitar a codificação paralela das informações segundo o padrão HEVC: neste cenário, cada Tile é assinalado para uma específica unidade de processamento do codificador HEVC, o qual executa o processamento dos diferentes Tiles em paralelo. A ideias chave da arquitetura Hy- SVM incluem: projeto e gerenciamento de memórias para a aplicação específica de codificação de vídeo; uso de múltiplos níveis de memórias privadas e compartilhadas, com o objetivo de explorar o reuso de dados intra-Tile e inter-Tiles de forma combinada; uso de memórias do tipo scratchpad (SPMs) para o armazenamento interno da informações de forma eficiente em termos de consumo de energia; projeto de memórias híbridas utilizando as tecnologias SRAM e STTRAM como base. Uma metodologia de projeto é proposta para a arquitetura Hy-SVM, a qual aproveita propriedades específicas da aplicação para, de forma adequada, definir os parâmetros de projeto das memórias híbridas. De forma a prover adaptação em tempo de execução (para ambas as memórias on-chip e off-chip), a arquitetura Hy-SVM integra uma camada de gerenciamento composta pelas seguintes estratégias (1) predição do overlap (sobreposição de acessos), o qual busca identificar o comportamento dos acessos redundantes entre diferentes unidades de processamento do codificador HEVC a partir da análise dos acessos à memória das codificações dos quadros passados do vídeo, com o objetivo de aumentar o potencial de exploração do reuso de dados inter-Tiles; (2) gerenciamento dos acessos à memória externa, responsável por balancear a vazão de dados com a memória acumulada entre as múltiplas unidades de processamento do codificador HEVC paralelo, com o objetivo de melhorar o uso do barramento de comunicação com a memória externa; e (3) gerenciamento de dados das SPMs implementadas a partir de células de memória STT-RAM, o qual alivia estas células de acessos de escrita com alta atividade de chaveamento dos bits armazenados, com o objetivo de aumentar o tempo de vide destas células, bem como reduzir as penalidades relativas à ineficiência dos acessos de escrita nas memórias STT-RAM. O conhecimento específico da aplicação foi utilizado nas estratégias de gerenciamento em tempo de execução das seguintes formas: explorando parâmetros da codificação HEVC e realizando monitorando em tempo real dos acessos à memória realizados pelo codificador Estas informações são utilizadas tanto pelas técnicas de gerenciamento, quanto pelas metodologias de projeto das memórias. Baseadas nas decisões tomadas pela camada de gerenciamento, a arquitetura Hy-SVM integra unidades de gerenciamento de acessos à memória (memory access management units – MAMUs) para controlar as dinâmicas de acesso das memórias SPM privadas e compartilhadas. Além disso, unidades adaptativas de gerenciamento de potência (adaptive power management units – APMUs) são capazes de reduzir o consumo de energia interno do chip do codificador a partir das estimativas precisas de formação dos overlaps. Os resultados obtidos por meio dos experimentos realizados demonstram economias de consumo energético da arquitetura Hy-SVM, quando comparada a trabalhos relacionados, sob diversos cenários de teste. Quando comparada a estratégias de reuso de dados tradicionais para codificadores de vídeo, como o esquema Level-C, a exploração do reuso de dados combinado nos níveis intra-Tile e inter-Tiles provê 69%-79% de redução de energia. Considerando as arquiteturas de memória de vídeo com foco no padrão HEVC, os ganhos variaram desde 2,8% (pior caso) até 67% (melhor caso) Da perspectiva do consumo de energia relacionado à comunicação com a memória externa, a arquitetura Hy-SVM é capaz de melhorar o reuso de dados (por explorar também o reuso de dados inter-Tiles), resultando em um consumo de energia on-chip 11%-17% menor. Além disso, as APMUs contribuem para reduzir o consumo de energia on-chip da arquitetura Hy-SVM em 56%-95%, para os cenários de teste analisados. Desta forma, comparada aos trabalhos relacionados, a arquitetura Hy-SVM apresenta o menor consumo energético on-chip. O gerenciamento da vazão da comunicação com a memória externa é capaz de reduzir as variações de largura de banda em 37%-83%, quando comparado à ordem tradicional de processamento, para cenários de teste com 4 e 16 Tiles sendo processados em paralelo pelo codificador HEVC. O gerenciamento de dados pôde, de forma significativa, estender o tempo de vida das células de memória STT-RAM, alcançando 0,83 de tempo de vida normalizado (métrica adotada para comparação, ficando muito próximo do caso ideal). Além disso, as sobrecargas causadas pela implementação das unidades de gerenciamento não afetam de foram significativa a performance e a eficiência energética da arquitetura Hy- SVM propostas por este trabalho.<br>This Thesis presents the design of an energy-efficient hybrid scratchpad video memory architecture (called Hy-SVM) for parallel High-Efficiency Video Coding. Video coding stands out as a high complex part in the video processing applications. HEVC standard brought innovations that increase the memory requirements, mainly due to: (a) the novel coding structures, which aggravates the computational complexity by providing a wider range of possibilities to be analyzed; and (b) the high-level parallelism features provided by the Tiles partitioning, which provides performance acceleration, but, at the same time, strongly adds hard challenges to the memory infrastructure. The main bottleneck in terms of external memory transmission and on-chip storage is the reference frames data: which consists of already coded (and reconstructed) entire frames that must be stored and intensively accessed during the encoding process of future frames. Due to the large volume of data required to represent the reference frames, they are typically stored in the external memory (especially when highdefinition videos are targeted). The proposed Hy-SVM architecture is inserted in a video coding system, which is based on multiple Tiles partitioning to enable parallel HEVC encoding: each Tile is assigned to a specific processing unit. The key ideas of Hy-SVM include: applicationspecific design and management; combined multiple levels of private and shared memories that jointly exploit intra-Tile and inter-Tiles data reuse; scratchpad memories (SPMs) as energyefficient on-chip data storage; combined SRAM and STT-RAM hybrid memory (HyM) design We propose a design methodology for Hy-SVM that leverages application-specific properties to properly define the HyMs parameters. In order to provide run-time adaptation (for both offand on-chip parts), Hy-SVM integrates a memory management layer composed of: (1) overlap prediction, which has the goal of identifying the redundant memory access behavior by analyzing monitored past frames encoding to increase inter-Tiles data reuse exploitation; (2) memory pressure management, which aims on balancing the Tiles-accumulated memory pressure targeting on improving external memory communication channel usage; and (3) lifetime-aware data management scheme that alleviates STT-RAM SPMs of high bit-toggling write accesses to increase the their cells lifetime, as well as to reduce overhead issues related to poor write characteristics of STT-RAM. Application-specific knowledge was exploited by inheriting HEVC properties and performing run-time monitoring of memory accesses. Such information is used to properly design the on-chip video memories, as well as being utilized as input parameters of the run-time memory management layer. Based on the run-time decisions from the proposed Hy-SVM management strategies, Hy-SVM integrates distributed memory access management units (MAMUs) to control the access dynamics of private and shared SPMs. Additionally, adaptive power management units (APMUs) are able to strongly reduce on-chip energy consumption due to an accurate overlap prediction The experimental results demonstrate Hy-SVM overall energy savings over related works under various HEVC encoding scenarios. Compared to traditional data reuse schemes, like Level-C, the combined intra-Tile and inter-Tiles data reuse provides 69%-79% of energy reduction. Regarding related HEVC video memory architectures, the savings varied from 2.8% (worst case) to 67% (best case). From the external memory perspective, Hy-SVM can improve data reuse (by also exploiting inter-Tiles data redundancy), resulting on 11%-71%% of reduced off-chip energy consumption. Additionally, our APMUs contribute by reducing on-chip energy consumption of Hy-SVM by 56%-95%, for the evaluated HEVC scenarios. Thus, compared to related works, Hy-SVM presents the lowest on-chip energy consumption. The memory pressure management scheme can reduce the variations in the memory bandwidth by 37%-83% when compared to the traditional raster scan processing for 4- and 16-core parallelized HEVC encoder. The lifetime-aware data management significantly extends the STT-RAM lifetime, achieving 0.83 of normalized lifetime (near to the optimal case). Moreover, the overhead of implementing our management units insignificantly affects the performance and energyefficiency of Hy-SVM.
APA, Harvard, Vancouver, ISO, and other styles
26

Damasceno, Alexandro Lima. "O impacto da hierarquia de memória sobre a arquitetura IPNoSys." Universidade Federal Rural do Semi-Árido, 2016. http://bdtd.ufersa.edu.br:80/tede/handle/tede/654.

Full text
Abstract:
Submitted by Lara Oliveira (lara@ufersa.edu.br) on 2017-04-10T21:22:16Z No. of bitstreams: 1 AlexandroLD_DISSERT.pdf: 4478017 bytes, checksum: b25b015c0ae937a3ba2f2718697a3977 (MD5)<br>Approved for entry into archive by Vanessa Christiane (referencia@ufersa.edu.br) on 2017-04-13T14:42:00Z (GMT) No. of bitstreams: 1 AlexandroLD_DISSERT.pdf: 4478017 bytes, checksum: b25b015c0ae937a3ba2f2718697a3977 (MD5)<br>Approved for entry into archive by Vanessa Christiane (referencia@ufersa.edu.br) on 2017-04-13T15:00:20Z (GMT) No. of bitstreams: 1 AlexandroLD_DISSERT.pdf: 4478017 bytes, checksum: b25b015c0ae937a3ba2f2718697a3977 (MD5)<br>Made available in DSpace on 2017-04-13T15:07:49Z (GMT). No. of bitstreams: 1 AlexandroLD_DISSERT.pdf: 4478017 bytes, checksum: b25b015c0ae937a3ba2f2718697a3977 (MD5) Previous issue date: 2016-07-27<br>Coordenação de Aperfeiçoamento de Pessoal de Nível Superior<br>Over the years, with the as technology advances, the search for improvements in the performance of computer systems is notable. The computer systems have evolved in both processing capacity and complexity of the implemented architectures. In such systems it is crucial to use memories since they are responsible for storing data to be processed. Considering an ideal environment, the memories should have a unlimited storage capacity, instant data access and the extremely low cost per bit. But in real systems the memories do not exhibit these characteristics. Storage capacity, speed and cost per bit are factors that increase in proportion to each other. One technique that is used to balance these factors and improve the performance of computer systems is the memory hierarchy. In the scenario of new technologies and proposals for new organizations of processors, a model that has been adopted by designers of computer systems is the use of MPSoCs (multiprocessor systems on chip), which has a higher energy and computational e ciency. In this scenario with many processing elements, networks using on-chip (NoC - networks-on-chip) is more e cient use of the buses. An NoC consists of a set of routers and interconnected channels forming a switched network. The cores are connected to network terminals and communication occurs through the exchange of packets. These NoCs have traditionally been exclusively designed for communication SoCs. However, a project of an unconventional architecture decided to integrate processing and communication in an NoC. This architecture is known for IPNoSys. The IPNoSys (Integrated Processing NoC System) architecture is an unconventional processor that uses networks on chip and implements processing units and routing to handle and process instructions. It takes advantage of the characteristics of NoC, such as scalability and parallel communication, for implement e ectively runs programs that exploit parallelism-level threads. Currently, IPNoSys architecture has four memory physically distributed at the corners of the network, but represent a unified addressing. Each memory module is associated with an access unit in charge of managing it. Given the current organization of IPNoSys memories, this work proposes to develop a new memory hierarchy system for IPNoSys and investigate the possible impact on performance and the programming model<br>Aolongo dos anos,coma ascensão das tecnologias, a busca por melhorias no desempenho dos sistemas computacionais é algo notável. Os sistemas computacionais evoluíram tanto em capacidade de processamento como em complexidade das arquiteturas implementadas. Nesses sistemas é crucial a utilização de memórias uma vez que elas são responsáveis pelo armazenamento de dados que serão processados. Considerando um ambiente ideal, as memórias deveriam ter uma capacidade de armazenamento ilimitado, o acesso de dados imediato e o custo por bit extremamente baixo. Porém nos sistemas reais as memórias não apresentam essas características. Capacidade de armazenamento, velocidade e custo por bit são fatores que crescem proporcionalmente entre si. Uma técnica que é utilizada para balancear esses fatores e melhorar o desempenho dos sistemas computacionais é a hierarquia de memória. No cenário de novas tecnologias e propostas de novas organizações de processadores, um modelo que vem sendo adotada pelos projetistas de sistemas computacionais é o uso de MPSoCs (sistemas multiprocessados integrados em chip), que apresenta uma maior eficiência energética e computacional. Nesse cenário com muitos elementos de processamento, a utilização de redes em chip (NoC - networks-on-chip) se mostra mais eficiente que o uso de barramentos. Uma NoC consiste em um conjunto de roteadores e canais interligados formando uma rede chaveada. Os núcleos são conectados aos terminais da rede e a comunicação ocorre pela troca de pacotes. Essas NoCs foram tradicionalmente projetadas exclusivamente para a comunicação em SoCs. Entretanto, um projeto de uma arquitetura não convencional resolveu integrar processamento e comunicação em uma NoC. Essa arquitetura é conhecida por IPNoSys. A arquitetura IPNoSys (Integrated Processing NoC System) é um processador não convencional que utiliza redes em chip e implementa unidades de processamento e roteamento para tratar e processar instruções. Aproveita as características das NoCs, como escalabilidade e comunicação paralela, para implementar de maneira eficiente execuções de programas que exploram paralelismo em nível de threads. Atualmente, a arquitetura IPNoSys possui quatro memórias fisicamente distribuidas nos cantos da rede, mas que representam um endereçamento unificado. Cada módulo de memória é associado a uma unidade de acesso que se encarregam de gerenciá-la. Diante da atual organização de memórias da IPNoSys, esse trabalho desenvolveu um novo sistema de hierarquia de memórias para o IPNoSys e investigou os possíveis impactos sobre o desempenho e o modelo de programação<br>2017-04-10
APA, Harvard, Vancouver, ISO, and other styles
27

Sampaio, Felipe Martin. "Energy-efficient memory hierarchy for motion and disparity estimation in multiview video coding." reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 2013. http://hdl.handle.net/10183/71292.

Full text
Abstract:
Esta dissertação de mestrado propõe uma hierarquia de memória para a Estimação de Movimento e de Disparidade (ME/DE) centrada nas referências da codificação, estratégia chamada de Reference-Centered Data Reuse (RCDR), com foco em redução de energia em codificadores de vídeo multivistas (MVC - Multiview Video Coding). Nos codificadores MVC, a ME/DE é responsável por praticamente 98% do consumo total de energia. Além disso, até 90% desta energia está relacionada com a memória do codificador: (a) acessos à memória externa para a busca das referências da ME/DE (45%) e (b) memória interna (cache) para manter armazenadas as amostras da área de busca e enviá-las para serem processadas pela ME/DE (45%). O principal objetivo deste trabalho é minimizar de maneira conjunta a energia consumida pelo módulo de ME/DE com relação às memórias externa e interna necessárias para a codificação MVC. A hierarquia de memória é composta por uma memória interna (a qual armazena a área de busca inteira), um controle dinâmico para a estratégia de power-gating da memória interna e um compressor de resultados parciais. Um controle de buscas foi proposto para explorar o comportamento da busca com o objetivo de atingir ainda mais reduções de energia. Além disso, este trabalho também agrega à hierarquia de memória um compressor de quadros de referência de baixa complexidade. A estratégia RCDR provê reduções de até 68% no consumo de energia quando comparada com estratégias estadoda- arte que são centradas no bloco atual da codificação. O compressor de resultados parciais é capaz de reduzir em 52% a comunicação com memória externa necessária para o armazenamento desses elementos. Quando comparada a técnicas de reuso de dados que não acessam toda área de busca, a estratégia RCDR também atinge os melhores resultados em consumo de energia, visto que acessos regulares a memórias externas DDR são energeticamente mais eficientes. O compressor de quadros de referência reduz ainda mais o número de acessos a memória externa (2,6 vezes menos acessos), aliando isso a perdas insignificantes na eficiência da codificação MVC. A memória interna requerida pela estratégia RCDR é até 74% menor do que estratégias centradas no bloco atual, como Level C. Além disso, o controle dinâmico para a técnica de power-gating provê reduções de até 82% na energia estática, o que é o melhor resultado entre os trabalho relacionados. A energia dinâmica é tratada pela técnica de união dos blocos candidatos, atingindo ganhos de mais de 65%. Considerando as reduções de consumo de energia atingidas pelas técnicas propostas neste trabalho, conclui-se que o sistema de hierarquia de memória proposto nesta dissertação atinge seu objetivo de atender às restrições impostas pela codificação MVC, no que se refere ao processamento do módulo de ME/DE.<br>This Master Thesis proposes a memory hierarchy for the Motion and Disparity Estimation (ME/DE) centered on the encoding references, called Reference-Centered Data Reuse (RCDR), focusing on energy reduction in the Multiview Video Coding (MVC). In the MVC encoders the ME/DE represents more than 98% of the overall energy consumption. Moreover, in the overall ME/DE energy, up to 90% is related to the memory issues, and only 10% is related to effective computation. The two items to be concerned with: (1) off-chip memory communication to fetch the reference samples (45%) and (2) on-chip memory to keep stored the search window samples and to send them to the ME/DE processing core (45%). The main goal of this work is to jointly minimize the on-chip and off-chip energy consumption in order to reduce the overall energy related to the ME/DE on MVC. The memory hierarchy is composed of an onchip video memory (which stores the entire search window), an on-chip memory gating control, and a partial results compressor. A search control unit is also proposed to exploit the search behavior to achieve further energy reduction. This work also aggregates to the memory hierarchy a low-complexity reference frame compressor. The experimental results proved that the proposed system accomplished the goal of the work of jointly minimizing the on-chip and off-chip energies. The RCDR provides off-chip energy savings of up to 68% when compared to state-of-the-art. the traditional MBcentered approach. The partial results compressor is able to reduce by 52% the off-chip memory communication to handle this RCDR penalty. When compared to techniques that do not access the entire search window, the proposed RCDR also achieve the best results in off-chip energy consumption due to the regular access pattern that allows lots of DDR burst reads (30% less off-chip energy consumption). Besides, the reference frame compressor is capable to improve by 2.6x the off-chip memory communication savings, along with negligible losses on MVC encoding performance. The on-chip video memory size required for the RCDR is up to 74% smaller than the MB-centered Level C approaches. On top of that, the power-gating control is capable to save 82% of leakage energy. The dynamic energy is treated due to the candidate merging technique, with savings of more than 65%. Due to the jointly off-chip communication and on-chip storage energy savings, the proposed memory hierarchy system is able to meet the MVC constraints for the ME/DE processing.
APA, Harvard, Vancouver, ISO, and other styles
28

Zaib, Muhammad Aurang [Verfasser], Andreas [Akademischer Betreuer] Herkersdorf, Jürgen [Gutachter] Becker, and Andreas [Gutachter] Herkersdorf. "Network on Chip Interface for Scalable Distributed Shared Memory Architectures / Muhammad Aurang Zaib ; Gutachter: Jürgen Becker, Andreas Herkersdorf ; Betreuer: Andreas Herkersdorf." München : Universitätsbibliothek der TU München, 2018. http://d-nb.info/1153882604/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
29

Diokh, Thérèse. "Développement des technologies mémoires "back-end" résistives à base d'oxydes pour application dans des "Systems on Chip" avancés." Thesis, Grenoble, 2013. http://www.theses.fr/2013GRENT048.

Full text
Abstract:
Les mémoires résistives non volatiles à bases d'oxydes métalliques suscitent un intérêt croissant chez les industriels. Plus particulièrement, les mémoires non volatiles à base d'oxydes (OxRRAM) offrent des temps de programmation et d'accès très court, une faible consommation énergétique, un coût par bit très concurrentiel et une facilité de co-intégration dans le back-end avec du CMOS avancé. Ce travail de thèse a pour objectif le développement d'une mémoire OxRRAM facilement intégrable dans une technologie de fabrication CMOS avancée afin de montrer les avantages en vue de leur application dans des SoC. Une première étape fut la fabrication et l'analyse des cellules mémoires OxRRAM intégrant différents oxydes métalliques afin de choisir la solution la plus adaptée à être intégrée dans une technologie CMOS 65nm et 28nm. Des techniques de mesures dédiées ont été mises en place afin d'établir l'impact du diélectrique sur le fonctionnement de la mémoire OxRRAM en termes de polarisation, de temps de programmation, de courant de programmation et de mécanismes de transition. Des études statistiques et de fiabilité des différents états du point mémoire ont été aussi réalisées. La modélisation associée a permis de mieux comprendre les mécanismes de vieillissements et prédire des lois de durée de vie sous champ et en température des état écrit et effacé de la cellule OxRRAM. Les données expérimentales obtenues sur les cellules ont ensuite permis de concevoir et d'optimiser un circuit d'évaluation statistique de 16 Kbit en technologie CMOS 28nm en tenant compte de toutes les contraintes de design analogique<br>Oxide-based Resistive Random Acces Memories (OxRRAM) are nowadays considered among the most promising solutions for future generation of low-cost embedded non-volatile memories. The advantages of these memories are the scalability, low power consumption, high speed, complementary metal oxide semiconductor technology (CMOS) compatibility and ease of fabrication (the memory cell consisting of a Metal–Insulator– Metal (MIM) structure integrated in the back-end-of-line, plus an addressing element, i.e. a transistor or a diode) . The potential applications range from consumer – communications to automotive – industrial. This work deals with the development of an OxRRAM demonstrator into an advanced CMOS technology for System on Chip (SoC) application. We discuss the impact of different dielectrics materials (Ta2O5, ZrO2 and HfO2) and electrodes (Pt, Ti, TiN) on the memory performances and reliability in order to choose the best couple dielectric/electrode. We focus on the understanding of the memory switching physics that is involved in the programming of OxRRAM bit-cells. The failure and transition mechanism are presented for lifetime prediction. Some methodologies are presented in this PhD thesis for the optimization of the OxRRAM bit-cell performances and sizes according to a targeted Mutliple Time Programmable (MTP) memory application. We developed analog block systems to control and address the OxRRAM bit-cell taking to account the bipolar switching characteristics of the devices. Finally, these solutions are to be validated using a 1-kb OxRRAM demonstrator yet designed and fabricated in a logic 28-nm node CMOS technology. Keywords: Oxide Resistive memory (OxRRAM), High-k, MIM, CMOS, Characterization, Reliability, Modeling, Analog Design, Simulation
APA, Harvard, Vancouver, ISO, and other styles
30

Oliveira, Bruno Cruz de. "Simula??o de reservat?rios de petr?leo em ambiente MPSoC." Universidade Federal do Rio Grande do Norte, 2009. http://repositorio.ufrn.br:8080/jspui/handle/123456789/17996.

Full text
Abstract:
Made available in DSpace on 2014-12-17T15:47:50Z (GMT). No. of bitstreams: 1 BrunoCO.pdf: 708202 bytes, checksum: 3eb4368a0c268064bcd6ad892e1f2c0c (MD5) Previous issue date: 2009-05-22<br>The constant increase of complexity in computer applications demands the development of more powerful hardware support for them. With processor's operational frequency reaching its limit, the most viable solution is the use of parallelism. Based on parallelism techniques and the progressive growth in the capacity of transistors integration in a single chip is the concept of MPSoCs (Multi-Processor System-on-Chip). MPSoCs will eventually become a cheaper and faster alternative to supercomputers and clusters, and applications developed for these high performance systems will migrate to computers equipped with MP-SoCs containing dozens to hundreds of computation cores. In particular, applications in the area of oil and natural gas exploration are also characterized by the high processing capacity required and would benefit greatly from these high performance systems. This work intends to evaluate a traditional and complex application of the oil and gas industry known as reservoir simulation, developing a solution with integrated computational systems in a single chip, with hundreds of functional unities. For this, as the STORM (MPSoC Directory-Based Platform) platform already has a shared memory model, a new distributed memory model were developed. Also a message passing library has been developed folowing MPI standard<br>O constante aumento da complexidade das aplica??es demanda um suporte de hardware computacionalmente mais poderoso. Com a aproxima??o do limite de velocidade dos processadores, a solu??o mais vi?vel ? o paralelismo. Baseado nisso e na crescente capacidade de integra??o de transistores em um ?nico chip surgiram os chamados MPSoCs (Multiprocessor System-on-Chip) que dever?o ser, em um futuro pr?ximo, uma alternativa mais r?pida e mais barata aos supercomputadores e clusters. Aplica??es tidas como destinadas exclusivamente a execu??o nesses sistemas de alto desempenho dever?o migrar para m?quinas equipadas com MPSoCs dotados de dezenas a centenas de n?cleos computacionais. Aplica??es na ?rea de explora??o de petr?leo e g?s natural tamb?m se caracterizam pela enorme capacidade de processamento requerida e dever?o se beneficiar desses novos sistemas de alto desempenho. Esse trabalho apresenta uma avalia??o de uma tradicional e complexa aplica??o da ind?stria de petr?leo e g?s natural, a simula??o de reservat?rios, sob a nova ?tica do desenvolvimento de sistemas computacionais integrados em um ?nico chip, dotados de dezenas a centenas de unidades funcionais. Para isso, um modelo de mem?ria distribu?da foi desenvolvido para a plataforma STORM (MPSoC Directory-Based Platform), que j? contava com um modelo de mem?ria compartilhada. Foi desenvolvida, ainda, uma biblioteca de troca de mensagens para esse modelo de mem?ria seguindo o padr?o MPI
APA, Harvard, Vancouver, ISO, and other styles
31

Lee, Jaekyu. "Shared resource management for efficient heterogeneous computing." Diss., Georgia Institute of Technology, 2013. http://hdl.handle.net/1853/50217.

Full text
Abstract:
The demand for heterogeneous computing, because of its performance and energy efficiency, has made on-chip heterogeneous chip multi-processors (HCMP) become the mainstream computing platform, as the recent trend shows in a wide spectrum of platforms from smartphone application processors to desktop and low-end server processors. The performance of on-chip GPUs is not yet comparable to that of discrete GPU cards, but vendors have integrated more powerful GPUs and this trend will continue in upcoming processors. In this architecture, several system resources are shared between CPUs and GPUs. The sharing of system resources enables easier and cheaper data transfer between CPUs and GPUs, but it also causes resource contention problems between cores. The resource sharing problem has existed since the homogeneous (CPU-only) chip-multi processor (CMP) was introduced. However, resource sharing in HCMPs shows different aspects because of the different nature of CPU and GPU cores. In order to solve the resource sharing problem in HCMPs, we consider efficient shared resource management schemes, in particular tackling the problem in shared last-level cache and interconnection network. In the thesis, we propose four resource sharing mechanisms: First, we propose an efficient cache sharing mechanism that exploits the different characteristics of CPU and GPU cores to effectively share cache space between them. Second, adaptive virtual channel partitioning for on-chip interconnection network is proposed to isolate inter-application interference. By partitioning virtual channels to CPUs and GPUs, we can prevent the interference problem while guaranteeing quality-of-service (QoS) for both cores. Third, we propose a dynamic frequency controlling mechanism to efficiently share system resources. When both cores are active, the degree of resource contention as well as the system throughput will be affected by the operating frequency of CPUs and GPUs. The proposed mechanism tries to find optimal operating frequencies for both cores, which reduces the resource contention while improving system throughput. Finally, we propose a second cache sharing mechanism that exploits GPU-semantic information. The programming and execution models of GPUs are more strict and easier than those of CPUs. Also, programmers are asked to provide more information to the hardware. By exploiting these characteristics, GPUs can energy-efficiently exercise the cache and simpler, but more efficient cache partitioning can be enabled for HCMPs.
APA, Harvard, Vancouver, ISO, and other styles
32

Löf, Henrik. "Iterative and Adaptive PDE Solvers for Shared Memory Architectures." Doctoral thesis, Uppsala universitet, Avdelningen för teknisk databehandling, 2006. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-7136.

Full text
Abstract:
Scientific computing is used frequently in an increasing number of disciplines to accelerate scientific discovery. Many such computing problems involve the numerical solution of partial differential equations (PDE). In this thesis we explore and develop methodology for high-performance implementations of PDE solvers for shared-memory multiprocessor architectures. We consider three realistic PDE settings: solution of the Maxwell equations in 3D using an unstructured grid and the method of conjugate gradients, solution of the Poisson equation in 3D using a geometric multigrid method, and solution of an advection equation in 2D using structured adaptive mesh refinement. We apply software optimization techniques to increase both parallel efficiency and the degree of data locality. In our evaluation we use several different shared-memory architectures ranging from symmetric multiprocessors and distributed shared-memory architectures to chip-multiprocessors. For distributed shared-memory systems we explore methods of data distribution to increase the amount of geographical locality. We evaluate automatic and transparent page migration based on runtime sampling, user-initiated page migration using a directive with an affinity-on-next-touch semantic, and algorithmic optimizations for page-placement policies. Our results show that page migration increases the amount of geographical locality and that the parallel overhead related to page migration can be amortized over the iterations needed to reach convergence. This is especially true for the affinity-on-next-touch methodology whereby page migration can be initiated at an early stage in the algorithms. We also develop and explore methodology for other forms of data locality and conclude that the effect on performance is significant and that this effect will increase for future shared-memory architectures. Our overall conclusion is that, if the involved locality issues are addressed, the shared-memory programming model provides an efficient and productive environment for solving many important PDE problems.
APA, Harvard, Vancouver, ISO, and other styles
33

Faure, Etienne. "Communications matérielles / logicielles dans les systèmes sur puces multi-processeurs orientés télécommunications." Paris 6, 2007. http://www.theses.fr/2007PA066201.

Full text
Abstract:
Cette thèse présente un intergiciel de communication dans le contexte des systèmes embarqués sur puce. L'application est décrite sous la forme d'un graphe de tâches communicantes. Dans ce graphe, les tâches productrices et les tâches consommatrices associés à chaque canal sont en nombres quelconques. On représente donc explicitement les communications dans ce graphe pour aboutir à un graphe bi-partite. Dans ce graphe, les tâches peuvent être implantées sous la forme de threads logiciels, ou de coprocesseurs spécialisés. On souhaite cependant conserver un mécanisme de communication uniforme, quelle que soit la nature matérielle ou logicielle des tâches. Ces contraintes nous ont conduit à spécifier des canaux de communication par mémoire partagée et un protocole de communication en 5 étapes pour y accéder. Ce protocole est implanté sous la forme d'une bibliothèque de fonctions logicielles et d'un contrôleur matériel permettant à un coprocesseur d'utiliser ces canaux de communication.
APA, Harvard, Vancouver, ISO, and other styles
34

Belhadj, Amor Hela. "Hiérarchie mémoire dans les systèmes intégrés multiprocesseurs construits autour de réseaux sur puce." Thesis, Université Grenoble Alpes (ComUE), 2017. http://www.theses.fr/2017GREAM049/document.

Full text
Abstract:
Les systèmes parallèles de type multi/pluri-cœurs permettant d'obtenir une grande puissance de calcul à bas coût énergétique sont de nos jours une réalité. Néanmoins, l'exploitation des performances de ces architectures dépend de l'efficacité du système à gérer les accès aux données. Le but de nos travaux est d'améliorer l'efficacité de ces accès en exploitant les caractéristiques de l'architecture matérielle.Dans une première partie, nous proposons une nouvelle organisation de la hiérarchie des mémoires caches qui maximise l'utilisation de l'espace de stockage disponible à chaque niveau. Cette solution, basée sur les architectures à accès non uniforme au cache (NUCA), supporte les transferts inter et intra-niveau de la hiérarchie. Elle requiert un protocole de cohérence de cache qui s'adapte à ses spécifications.Certes, le transfert des données au niveau de la hiérarchie est aussi un déterminant de la performance du système. Dans une seconde partie, nous prenons en compte les besoins de communication spécifiques du protocole. Nous proposons un réseau virtualisé comme support de communication ad-hoc afin de gérer le trafic de cohérence à moindre coût. Ce dernier relie les caches d'un même niveau pour supporter les transferts intra-niveaux, qui sont une spécificité de notre protocole, en vue de réduire la latence moyenne d'accès<br>Multi/many-cores parallel systems for high-power computing at low energy costs are nowadays a reality. However, exploiting the performance of these architectures depends on the efficiency of the system in managing data accesses. The aim of our work is to improve the efficiency of these accesses by exploiting the hardware architecture characteristics.In a first part, we propose a new cache hierarchy organization that aims at maximizing the use of the available storage space at each level. This solution, based on non-uniform cache access architectures (NUCA), supports inter and intra-level transfers of the hierarchy. It requires a cache coherency protocol that suits its specifications.Obviously, the transfer of data in the hierarchy is also a determinant of the system performance. In a second part, we consider the specific communication needs of the protocol. We suggest the use of a virtualized network as an ad-hoc communication medium to manage consistency traffic at a lower cost. It links the caches of the same level to support intra-level transfers, which are a specificity of our protocol, in order to reduce the average access latency
APA, Harvard, Vancouver, ISO, and other styles
35

Feki, Anis. "Conception d’une mémoire SRAM en tension sous le seuil pour des applications biomédicales et les nœuds de capteurs sans fils en technologies CMOS avancées." Thesis, Lyon, INSA, 2015. http://www.theses.fr/2015ISAL0018/document.

Full text
Abstract:
L’émergence des circuits complexes numériques, ou System-On-Chip (SOC), pose notamment la problématique de la consommation énergétique. Parmi les blocs fonctionnels significatifs à ce titre, apparaissent les mémoires et en particulier les mémoires statiques (SRAM). La maîtrise de la consommation énergétique d’une mémoire SRAM inclue la capacité à rendre la mémoire fonctionnelle sous très faible tension d’alimentation, avec un objectif agressif de 300 mV (inférieur à la tension de seuil des transistors standard CMOS). Dans ce contexte, les travaux de thèse ont concerné la proposition d’un point mémoire SRAM suffisamment performant sous très faible tension d’alimentation et pour les nœuds technologiques avancés (CMOS bulk 28nm et FDSOI 28nm). Une analyse comparative des architectures proposées dans l’état de l’art a permis d’élaborer deux points mémoire à 10 transistors avec de très faibles impacts de courant de fuite. Outre une segmentation des ports de lecture, les propositions reposent sur l’utilisation de périphéries adaptées synchrones avec notamment une solution nouvelle de réplication, un amplificateur de lecture de données en mode tension et l’utilisation d’une polarisation dynamique arrière du caisson SOI (Body Bias). Des validations expérimentales s’appuient sur des circuits en technologies avancées. Enfin, une mémoire complète de 32kb (1024x32) a été soumise à fabrication en 28 FDSOI. Ce circuit embarque une solution de test (BIST) capable de fonctionner sous 300mV d’alimentation. Après une introduction générale, le 2ème chapitre du manuscrit décrit l’état de l’art. Le chapitre 3 présente les nouveaux points mémoire. Le 4ème chapitre décrit l’amplificateur de lecture avec la solution de réplication. Le chapitre 5 présente l’architecture d’une mémoire ultra basse tension ainsi que le circuit de test embarqué. Les travaux ont donné lieu au dépôt de 4 propositions de brevet, deux conférences internationales, un article de journal international est accepté et un autre vient d’être soumis<br>Emergence of large Systems-On-Chip introduces the challenge of power management. Of the various embedded blocks, static random access memories (SRAM) constitute the angrier contributors to power consumption. Scaling down the power supply is one way to act positively on power consumption. One aggressive target is to enable the operation of SRAMs at Ultra-Low-Voltage, i.e. as low as 300 mV (lower than the threshold voltage of standard CMOS transistors). The present work concerned the proposal of SRAM bitcells able to operate at ULV and for advanced technology nodes (either CMOS bulk 28 nm or FDSOI 28 nm). The benchmarking of published architectures as state-of-the-art has led to propose two flavors of 10-transitor bitcells, solving the limitations due to leakage current and parasitic power consumption. Segmented read-ports have been used along with the required synchronous peripheral circuitry including original replica assistance, a dedicated unbalanced sense amplifier for ULV operation and dynamic forward back-biasing of SOI boxes. Experimental test chips are provided in previously mentioned technologies. A complete memory cut of 32 kbits (1024x32) has been designed with an embedded BIST block, able to operate at ULV. After a general introduction, the manuscript proposes the state-of-the-art in chapter two. The new 10T bitcells are presented in chapter 3. The sense amplifier along with the replica assistance is the core of chapter 4. The memory cut in FDSOI 28 nm is detailed in chapter 5. Results of the PhD have been disseminated with 4 patent proposals, 2 papers in international conferences, a first paper accepted in an international journal and a second but only submitted paper in an international journal
APA, Harvard, Vancouver, ISO, and other styles
36

Cargnini, Luís Vitório. "Applications des technologies mémoires MRAM appliquées aux processeurs embarqués." Thesis, Montpellier 2, 2013. http://www.theses.fr/2013MON20091/document.

Full text
Abstract:
Le secteur Semi-conducteurs avec l'avènement de fabrication submicroniques coule dessous de 45 nm ont commencé à relever de nouveaux défis pour continuer à évoluer en fonction de la loi de Moore. En ce qui concerne l'adoption généralisée de systèmes embarqués une contrainte majeure est devenu la consommation d'énergie de l'IC. En outre, les technologies de mémoire comme le standard actuel de la technologie de mémoire intégré pour la hiérarchie de la mémoire, la mémoire SRAM, ou le flash pour le stockage non-volatile ont des contraintes complexes extrêmes pour être en mesure de produire des matrices de mémoire aux nœuds technologiques 45 nm ci-dessous. Un important est jusqu'à présent mémoire non volatile n'a pas été adopté dans la hiérarchie mémoire, en raison de sa densité et comme le flash sur la nécessité d'un fonctionnement multi-tension.Ces thèses ont fait, par le travail dans l'objectif de ces contraintes et de fournir quelques réponses. Dans la thèse sera présenté méthodes et les résultats extraits de ces méthodes pour corroborer notre objectif de définir une feuille de route à adopter une nouvelle technologie de mémoire non volatile, de faible puissance, à faible fuite, SEU / MEU-résistant, évolutive et avec similaire le rendement en courant de la SRAM, physiquement équivalente à SRAM, ou encore mieux, avec une densité de surface de 4 à 8 fois la surface d'une cellule SRAM, sans qu'il soit nécessaire de domaine multi-tension comme FLASH. Cette mémoire est la MRAM (mémoire magnétique), selon l'ITRS avec un candidat pour remplacer SRAM dans un proche avenir. MRAM au lieu de stocker une charge, ils stockent l'orientation magnétique fournie par l'orientation de rotation-couple de l'alliage sans la couche dans la MTJ (Magnetic Tunnel Junction). Spin est un état quantical de la matière, que dans certains matériaux métalliques peuvent avoir une orientation ou son couple tension à appliquer un courant polarisé dans le sens de l'orientation du champ souhaitée.Une fois que l'orientation du champ magnétique est réglée, en utilisant un amplificateur de lecture, et un flux de courant à travers la MTJ, l'élément de cellule de mémoire de MRAM, il est possible de mesurer l'orientation compte tenu de la variation de résistance, plus la résistance plus faible au passage de courant, le sens permettra d'identifier un zéro logique, diminuer la résistance de la SA détecte une seule logique. Donc, l'information n'est pas une charge stockée, il s'agit plutôt d'une orientation du champ magnétique, raison pour laquelle il n'est pas affecté par SEU ou MEU due à des particules de haute énergie. En outre, il n'est pas dû à des variations de tensions de modifier le contenu de la cellule de mémoire, le piégeage charges dans une grille flottante.En ce qui concerne la MRAM, cette thèse a par adresse objective sur les aspects suivants: MRAM appliqué à la hiérarchie de la mémoire:- En décrivant l'état actuel de la technique dans la conception et l'utilisation MRAM dans la hiérarchie de mémoire;- En donnant un aperçu d'un mécanisme pour atténuer la latence d'écriture dans MRAM au niveau du cache (Principe de banque de mémoire composite);- En analysant les caractéristiques de puissance d'un système basé sur la MRAM sur Cache L1 et L2, en utilisant un débit d'évaluation dédié- En proposant une méthodologie pour déduire une consommation d'énergie du système et des performances.- Et pour la dernière base dans les banques de mémoire analysant une banque mémoire Composite, une description simple sur la façon de générer une banque de mémoire, avec quelques compromis au pouvoir, mais la latence équivalente à la SRAM, qui maintient des performances similaires<br>The Semiconductors Industry with the advent of submicronic manufacturing flows below 45 nm began to face new challenges to keep evolving according with the Moore's Law. Regarding the widespread adoption of embedded systems one major constraint became power consumption of IC. Also, memory technologies like the current standard of integrated memory technology for memory hierarchy, the SRAM, or the FLASH for non-volatile storage have extreme intricate constraints to be able to yield memory arrays at technological nodes below 45nm. One important is up until now Non-Volatile Memory weren't adopted into the memory hierarchy, due to its density and like flash the necessity of multi-voltage operation. These theses has by objective work into these constraints and provide some answers. Into the thesis will be presented methods and results extracted from this methods to corroborate our goal of delineate a roadmap to adopt a new memory technology, non-volatile, low-power, low-leakage, SEU/MEU-resistant, scalable and with similar performance as the current SRAM, physically equivalent to SRAM, or even better with a area density between 4 to 8 times the area of a SRAM cell, without the necessity of multi-voltage domain like FLASH. This memory is the MRAM (Magnetic Memory), according with the ITRS one candidate to replace SRAM in the near future. MRAM instead of storing charge, they store the magnetic orientation provided by the spin-torque orientation of the free-layer alloy in the MTJ (Magnetic Tunnel Junction). Spin is a quantical state of matter, that in some metallic materials can have it orientation or its torque switched applying a polarized current in the sense of the field orientation desired. Once the magnetic field orientation is set, using a sense amplifier, and a current flow through the MTJ, the memory cell element of MRAM, it is possible to measure the orientation given the resistance variation, higher the resistance lower the passing current, the sense will identify a logic zero, lower the resistance the SA will sense a one logic. So the information is not a charge stored, instead it is a magnetic field orientation, reason why it is not affected by SEU or MEU caused due to high energy particles. Also it is not due to voltages variations to change the memory cell content, trapping charges in a floating gate. Regarding the MRAM, this thesis has by objective address the following aspects: MRAM applied to memory Hierarchy: - By describing the current state of the art in MRAM design and use into memory hierarchy; - by providing an overview of a mechanism to mitigate the latency of writing into MRAM at the cache level (Principle to composite memory bank); - By analyzing power characteristics of a system based on MRAM on CACHE L1 and L2, using a dedicated evaluation flow- by proposing a methodology to infer a system power consumption, and performances.- and for last based into the memory banks analysing a Composite Memory Bank, a simple description on how to generate a memory bank, with some compromise in power, but equivalent latency to the SRAM, that keeps similar performance
APA, Harvard, Vancouver, ISO, and other styles
37

Kumar, T. S. Rajesh. "On-Chip Memory Architecture Exploration Of Embedded System On Chip." Thesis, 2008. http://hdl.handle.net/2005/752.

Full text
Abstract:
Today’s feature-rich multimedia products require embedded system solution with complex System-on-Chip (SoC) to meet market expectations of high performance at low cost and lower energy consumption. SoCs are complex designs with multiple embedded processors, memory subsystems, and application specific peripherals. The memory architecture of embedded SoCs strongly influences the area, power and performance of the entire system. Further, the memory subsystem constitutes a major part (typically up to 70%) of the silicon area for the current day SoC. The on-chip memory organization of embedded processors varies widely from one SoC to another, depending on the application and market segment for which the SoC is deployed. There is a wide variety of choices available for the embedded designers, starting from simple on-chip SPRAM based architecture to more complex cache-SPRAM based hybrid architecture. The performance of a memory architecture also depends on how the data variables of the application are placed in the memory. There are multiple data layouts for each memory architecture that are efficient from a power and performance viewpoint. Further, the designer would be interested in multiple optimal design points to address various market segments. Hence a memory architecture exploration for an embedded system involves evaluating a large design space in the order of 100,000 of design points and each design points having several tens of thousands of data layouts. Due to its large impact on system performance parameters, the memory architecture is often hand-crafted by experienced designers exploring a very small subset of this design space. The vast memory design space prohibits any possibility for a manual analysis. In this work, we propose an automated framework for on-chip memory architecture exploration. Our proposed framework integrates memory architecture exploration and data layout to search the design space efficiently. While the memory exploration selects specific memory architectures, the data layout efficiently maps the given application on to the memory architecture under consideration and thus helps in evaluating the memory architecture. The proposed memory exploration framework works at both logical and physical memory architecture level. Our work addresses on-chip memory architecture for DSP processors that is organized as multiple memory banks, with each back can be a single/dual port banks and with non-uniform bank sizes. Further, our work also address memory architecture exploration for on-chip memory architectures that is SPRAM and cache based. Our proposed method is based on multi-objective Genetic Algorithm based and outputs several hundred Pareto-optimal design solutions that are interesting from a area, power and performance viewpoints within a few hours of running on a standard desktop configuration.
APA, Harvard, Vancouver, ISO, and other styles
38

Ho, Yui Luen, and Jeremy Yui Luen Ho. "Processor memory traffic characteristics for on-chip cache." Thesis, 1992. http://hdl.handle.net/1957/36922.

Full text
Abstract:
The motivation of this research is to study different cache designs for on-chip caches that improve processor performance and at the same time minimize the degradation to system performance caused by an increase in the processor memory traffic. As VLSI technology advances we can have bigger and more complex on-chip caches that could not have been possible a few years ago. Results derived from on-chip caches and performance issues are basically similar to off-chip caches. In this study, we will concentrate on single level on-chip caches though there are many interesting issues relating system performance, memory traffic and multi-level caches.<br>Graduation date: 1992
APA, Harvard, Vancouver, ISO, and other styles
39

Jeong, Min Kyu. "Core-characteristic-aware off-chip memory management in a multicore system-on-chip." Thesis, 2012. http://hdl.handle.net/2152/ETD-UT-2012-12-6765.

Full text
Abstract:
Future processors will integrate an increasing number of cores because the scaling of single-thread performance is limited and because smaller cores are more power efficient. Off-chip memory bandwidth that is shared between those many cores, however, scales slower than the transistor (and core) count does. As a result, in many future systems, off-chip bandwidth will become the bottleneck of heavy demand from multiple cores. Therefore, optimally managing the limited off-chip bandwidth is critical to achieving high performance and efficiency in future systems. In this dissertation, I will develop techniques to optimize the shared use of limited off-chip memory bandwidth in chip-multiprocessors. I focus on issues that arise from the sharing and exploit the differences in memory access characteristics, such as locality, bandwidth requirement, and latency sensitivity, between the applications running in parallel and competing for the bandwidth. First, I investigate how the shared use of memory by many cores can result in reduced spatial locality in memory accesses. I propose a technique that partitions the internal memory banks between cores in order to isolate their access streams and eliminate locality interference. The technique compensates for the reduced bank-level parallelism of each thread by employing memory sub-ranking to effectively increase the number of independent banks. For three different workload groups that consist of benchmarks with high spatial locality, low spatial locality, and mixes of the two, the average system efficiency improves by 10%, 7%, 9% for 2-rank systems, and 18%, 25%, 20% for 1-rank systems, respectively, over the baseline shared-bank system. Next, I improve the performance of a heterogeneous system-on-chip (SoC) in which cores have distinct memory access characteristics. I develop a deadline-aware shared memory bandwidth management scheme for SoCs that have both CPU and GPU cores. I show that statically prioritizing the CPU can severely constrict GPU performance, and propose to dynamically adapt the priority of CPU and GPU memory requests based on the progress of GPU workload. The proposed dynamic bandwidth management scheme provides the target GPU performance while prioritizing CPU performance as much as possible, for any CPU-GPU workload combination with different complexities.<br>text
APA, Harvard, Vancouver, ISO, and other styles
40

"Improving on-chip data cache using instruction register information." Chinese University of Hong Kong, 1996. http://library.cuhk.edu.hk/record=b5888778.

Full text
Abstract:
by Lau Siu Chung.<br>Thesis (M.Phil.)--Chinese University of Hong Kong, 1996.<br>Includes bibliographical references (leaves 71-74).<br>Abstract --- p.i<br>Acknowledgment --- p.ii<br>List of Figures --- p.v<br>Chapter Chapter 1 --- Introduction --- p.1<br>Chapter 1.1 --- Hiding memory latency --- p.1<br>Chapter 1.2 --- Organization of dissertation --- p.4<br>Chapter Chapter 2 --- Related Work --- p.5<br>Chapter 2.1 --- Hardware controlled cache prefetching --- p.5<br>Chapter 2.2 --- Software assisted cache prefetching --- p.9<br>Chapter Chapter 3 --- Data Prefetching --- p.13<br>Chapter 3.1 --- Data reference patterns --- p.14<br>Chapter 3.2 --- Embedded hints for next data references --- p.19<br>Chapter 3.3 --- Instruction Opcode and Addressing Mode Prefetching scheme --- p.21<br>Chapter 3.3.1 --- Basic IAP scheme --- p.21<br>Chapter 3.3.2 --- Enhanced IAP scheme --- p.24<br>Chapter 3.3.3 --- Combined IAP scheme --- p.27<br>Chapter 3.4 --- Summary --- p.29<br>Chapter Chapter 4 --- Performance Evaluation --- p.31<br>Chapter 4.1 --- Evaluation methodology --- p.31<br>Chapter 4.1.1 --- Trace-driven simulation --- p.31<br>Chapter 4.1.2 --- Caching models --- p.33<br>Chapter 4.1.3 --- Benchmarks and metrics --- p.36<br>Chapter 4.2 --- General Results --- p.41<br>Chapter 4.2.1 --- Varying cache size --- p.44<br>Chapter 4.2.2 --- Varying cache block size --- p.46<br>Chapter 4.2.3 --- Varying associativity --- p.49<br>Chapter 4.3 --- Other performance metrics --- p.52<br>Chapter 4.3.1 --- Accuracy of prefetch --- p.52<br>Chapter 4.3.2 --- Partial hit delay --- p.55<br>Chapter 4.3.3 --- Bus usage problem --- p.59<br>Chapter 4.4 --- Zero time prefetch --- p.63<br>Chapter 4.5 --- Summary --- p.67<br>Chapter Chapter 5 --- Conclusion --- p.68<br>Chapter 5.1 --- Summary of our research --- p.68<br>Chapter 5.2 --- Future work --- p.70<br>Bibliography --- p.71
APA, Harvard, Vancouver, ISO, and other styles
41

Nagda, Tanvi. "Memory interface architecture for network on chip based systems /." 2006. http://proquest.umi.com/pqdweb?did=1225157671&sid=2&Fmt=2&clientId=10361&RQT=309&VName=PQD.

Full text
APA, Harvard, Vancouver, ISO, and other styles
42

Chang, Tian Sheuan, and 張添烜. "On-chip Memory Module Designs for Video Signal Processing." Thesis, 1995. http://ndltd.ncl.edu.tw/handle/40745057865457257330.

Full text
Abstract:
碩士<br>國立交通大學<br>電子研究所<br>83<br>Video signal processing chips often require fast embedded memory to provide sufficient bandwidth to parallel data path. By taking advantages of the features of video signal processing, two novel embedded memory designs are proposed and implemented for video signal processing. Concurrent line access performs multiport memory capability with single port cell cost and access time. It avoids the cell stability problem, larger area and access time overhead of multiport memories. Our design for dual ports uses 62.24% area of dual ports memory and 7.6% larger than single port memory for 2Kx8 size. Block access mode provides fast addressing by combining address decoders and generators. The access time of block access mode is 26% faster than conventional scheme for size 256wx32b. Although the two fast modes exhibit some restriction of prefer -access-order, the design is no loss of generality because algorithms of video signal processing possess high data parallelism and less dependency. A flexible memory generator that produces is developed using TSMC 0.8um SPDM CMOS technology. The access time of generated three-ports RAM is 5.6ns for 16Kb size. A test chip has also been developed and fabricated through CIC.
APA, Harvard, Vancouver, ISO, and other styles
43

Zhang, Tian Xuan, and 張添烜. "On-chip memory module designs for video signal processing." Thesis, 1995. http://ndltd.ncl.edu.tw/handle/38191676307717593428.

Full text
APA, Harvard, Vancouver, ISO, and other styles
44

Chang, Pei-Yao, and 張倍耀. "On Chip Memory Designs for Ultra-Low-Voltage SoC." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/gwmqet.

Full text
Abstract:
博士<br>國立中正大學<br>電機工程研究所<br>101<br>Ultra-low voltage operations have become popular alternatives for SoCs design in recent years. However, at a lower supply voltage, on-chip nanometer memories suffer from many problems. This dissertation gives comprehensive descriptions of several low-voltage and energy-efficient techniques of the memory circuit, and proposes several low-voltage memory circuits including the COMS process of the Static Random Access Memory (SRAM), Register File (RF), and Read Only Memory (ROM). First, a 4R/2W register file design for 2-issue microprocessors with ultra-wide dynamic voltage scaling is presented. A full-N separated read port has been proposed to save the area and to improve the performance for subthreshold operations. In addition, a reconfigurable write scheme has been proposed to utilize the unused write port with single-issue execution for the WNM improvement. A test chip has been designed and fabricated using 65nm process, of which a minimum operating voltage of 148mV has been measured. Second, the design of embedded subthreshold SRAMs for a quality-scalable H.264 video decoder IP is presented. In addition to the conventional 7T SRAM bitcell, power-gating techniques and multi-output dynamic circuits were adopted in order to achieve a low VDDmin, a small area overhead, and a higher operating speed. A 8Kb 90-nm SRAM macro has been designed for verifying the proposed techniques. Third, the low-voltage and variation-tolerant SRAMs are implemented in two 65nm test chips, including a 8Kb SRAM macro (0.25-1.0V) and a 512Kb frame buffer (0.45-1.0V). The main design techniques include a bitline leakage prediction scheme with a non-trimmed non-strobed S.A. or dynamic trip point S.A. to deal with process and runtime variations and data dependence. Finally, a 256Kb NOR-ROM is a design that realizes the proposed hierarchical bitline scheme with interleaved shielding and leakage-tracking NMOS keepers with half-rate-bit-line tracking sensing. The NOR-ROM has been fabricated in 90nm CMOS, of which the measured VDDmin is 0.22V. Several experimental chips of the proposed circuits are designed and fabricated. Through extensive analyses, simulations, fabrications and measurements, all proposed techniques for memory are verified. Memory circuits proposed in this dissertation allow lower supply voltage and more energy-efficient and more reliability in SoCs.
APA, Harvard, Vancouver, ISO, and other styles
45

Yeh, Chun-Wen, and 葉俊文. "Processor-Programmable Memory BIST Framework for System-on-Chip." Thesis, 2001. http://ndltd.ncl.edu.tw/handle/57805191702038533797.

Full text
Abstract:
碩士<br>國立清華大學<br>電機工程學系<br>89<br>Memory is now widely used in digital systems. The popular core-based System-on-Chip (SoC) environment always contains some kind and different size of memory cores. The testing for these embedded memory becomes more important and essential. In this thesis, we will present a processor-programmable memory built-in self-test (BIST) framework for SoC environment. We promise to build up a friendly and complete test framework to perform memory testing and verify our idea. In our SoC test environment, it includes one microprocessor, the programmable BIST circuit we proposed, self-defined bus, arbiter, I/O, and memory. The test framework can automatically execute most popular March test algorithms through brief description of March algorithm and register definition for our BIST circuit. It simulates March algorithm in a simple SoC environment after programming BIST circuit via on-chip microprocessor. Finally, it will show the test results and if any error occurs when testing, it can report the erroneous responses and faulty addresses, too. Compared with processor-based memory BIST schemes that use an assembly-language program to perform testing and comparison of the memory outputs, the test time of our proposed BIST circuit is greatly reduced. Compared with conventional dedicated BIST circuit, the area overhead can be reduced and flexibility is higher. The proposed framework can perform various March test algorithms and verify the functionality of our BIST circuit.
APA, Harvard, Vancouver, ISO, and other styles
46

Yen, Yu-Kai, and 顏于凱. "On-Chip Bus and Memory Architecture Exploration for Embedded SoC." Thesis, 2008. http://ndltd.ncl.edu.tw/handle/61733150912988246048.

Full text
APA, Harvard, Vancouver, ISO, and other styles
47

Liao, Kuang-Yao, and 廖光耀. "Platform Design on Intelligent Serial type of Flash Memory Chip." Thesis, 2005. http://ndltd.ncl.edu.tw/handle/14944891041374715054.

Full text
Abstract:
碩士<br>淡江大學<br>電機工程學系碩士班<br>93<br>This paper is to design a new developing flat-form for verifying an intelligent series flash memory IC. Also, we use this plat-form to test the function of access method of new memory control structure of the intelligent series flash memory IC and hardware interface. With the progress of technical and manufacturing technology, the functions of electronic equipments are getting more and more powerful, such as PDA , DSC, mobile phone and notebook computer . And it causes the user’s demand for the larger data storage become more and more. Therefore, it makes many kinds of memory card being come out, such as CF Card, SD Card and so on. But these electronic products need their exclusively interface when we want to access the data. At present, many electronic products do not integrate these interfaces, and it makes the user is inconvenient when using them. To solve this kind of question, the design of series flash memory IC using a common currency UART port to be the main interface for the data transmission and the command control, and provide a widely suitable interface. And we could the use additional parallel bus interface to achieve the ability of high speed data transmission, and reduce system’s resources when data access by using the functions of data storage and file management which is built inside it. Therefore, by using this storage medium and interface, we could manage and use the high density memory (card) without additional access interface of special specification. The description of this paper is to verify the specification and function of “the intelligent series flash memory IC” by using the developing plat-form of intelligent series flash memory IC. To carry out the design of whole developing plat-form, we need to use Altera FPGA, 8051 Microcontroller IP Core and Flash Memory, and also based on the framework of SOC plat-form. Actually, we got the result by verified with this developing flat-form, and confirmed it is workable to test the function of intelligent series flash memory IC by using this memory IC plat-form.
APA, Harvard, Vancouver, ISO, and other styles
48

"Unified on-chip multi-level cache management scheme using processor opcodes and addressing modes." Chinese University of Hong Kong, 1996. http://library.cuhk.edu.hk/record=b5895702.

Full text
Abstract:
by Stephen Siu-ming Wong.<br>Thesis (M.Phil.)--Chinese University of Hong Kong, 1996.<br>Includes bibliographical references (leaves 164-170).<br>Chapter 1 --- Introduction --- p.1<br>Chapter 1.1 --- Cache Memory --- p.2<br>Chapter 1.2 --- System Performance --- p.3<br>Chapter 1.3 --- Cache Performance --- p.3<br>Chapter 1.4 --- Cache Prefetching --- p.5<br>Chapter 1.5 --- Organization of Dissertation --- p.7<br>Chapter 2 --- Related Work --- p.8<br>Chapter 2.1 --- Memory Hierarchy --- p.8<br>Chapter 2.2 --- Cache Memory Management --- p.10<br>Chapter 2.2.1 --- Configuration --- p.10<br>Chapter 2.2.2 --- Replacement Algorithms --- p.13<br>Chapter 2.2.3 --- Write Back Policies --- p.15<br>Chapter 2.2.4 --- Cache Miss Types --- p.16<br>Chapter 2.2.5 --- Prefetching --- p.17<br>Chapter 2.3 --- Locality --- p.18<br>Chapter 2.3.1 --- Spatial vs. Temporal --- p.18<br>Chapter 2.3.2 --- Instruction Cache vs. Data Cache --- p.20<br>Chapter 2.4 --- Why Not a Large L1 Cache? --- p.26<br>Chapter 2.4.1 --- Critical Time Path --- p.26<br>Chapter 2.4.2 --- Hardware Cost --- p.27<br>Chapter 2.5 --- Trend to have L2 Cache On Chip --- p.28<br>Chapter 2.5.1 --- Examples --- p.29<br>Chapter 2.5.2 --- Dedicated L2 Bus --- p.31<br>Chapter 2.6 --- Hardware Prefetch Algorithms --- p.32<br>Chapter 2.6.1 --- One Block Look-ahead --- p.33<br>Chapter 2.6.2 --- Chen's RPT & similar algorithms --- p.34<br>Chapter 2.7 --- Software Based Prefetch Algorithm --- p.38<br>Chapter 2.7.1 --- Prefetch Instruction --- p.38<br>Chapter 2.8 --- Hybrid Prefetch Algorithm --- p.40<br>Chapter 2.8.1 --- Stride CAM Prefetching --- p.40<br>Chapter 3 --- Simulator --- p.43<br>Chapter 3.1 --- Multi-level Memory Hierarchy Simulator --- p.43<br>Chapter 3.1.1 --- Multi-level Memory Support --- p.45<br>Chapter 3.1.2 --- Non-blocking Cache --- p.45<br>Chapter 3.1.3 --- Cycle-by-cycle Simulation --- p.47<br>Chapter 3.1.4 --- Cache Prefetching Support --- p.47<br>Chapter 4 --- Proposed Algorithms --- p.48<br>Chapter 4.1 --- SIRPA --- p.48<br>Chapter 4.1.1 --- Rationale --- p.48<br>Chapter 4.1.2 --- Architecture Model --- p.50<br>Chapter 4.2 --- Line Concept --- p.56<br>Chapter 4.2.1 --- Rationale --- p.56<br>Chapter 4.2.2 --- "Improvement Over ""Pure"" Algorithm" --- p.57<br>Chapter 4.2.3 --- Architectural Model --- p.59<br>Chapter 4.3 --- Combined L1-L2 Cache Management --- p.62<br>Chapter 4.3.1 --- Rationale --- p.62<br>Chapter 4.3.2 --- Feasibility --- p.63<br>Chapter 4.4 --- Combine SIRPA with Default Prefetch --- p.66<br>Chapter 4.4.1 --- Rationale --- p.67<br>Chapter 4.4.2 --- Improvement Over “Pure´ح Algorithm --- p.69<br>Chapter 4.4.3 --- Architectural Model --- p.70<br>Chapter 5 --- Results --- p.73<br>Chapter 5.1 --- Benchmarks Used --- p.73<br>Chapter 5.1.1 --- SPEC92int and SPEC92fp --- p.75<br>Chapter 5.2 --- Configurations Tested --- p.79<br>Chapter 5.2.1 --- Prefetch Algorithms --- p.79<br>Chapter 5.2.2 --- Cache Sizes --- p.80<br>Chapter 5.2.3 --- Cache Block Sizes --- p.81<br>Chapter 5.2.4 --- Cache Set Associativities --- p.81<br>Chapter 5.2.5 --- "Bus Width, Speed and Other Parameters" --- p.81<br>Chapter 5.3 --- Validity of Results --- p.83<br>Chapter 5.3.1 --- Total Instructions and Cycles --- p.83<br>Chapter 5.3.2 --- Total Reference to Caches --- p.84<br>Chapter 5.4 --- Overall MCPI Comparison --- p.86<br>Chapter 5.4.1 --- Cache Size Effect --- p.87<br>Chapter 5.4.2 --- Cache Block Size Effect --- p.91<br>Chapter 5.4.3 --- Set Associativity Effect --- p.101<br>Chapter 5.4.4 --- Hardware Prefetch Algorithms --- p.108<br>Chapter 5.4.5 --- Software Based Prefetch Algorithms --- p.119<br>Chapter 5.5 --- L2 Cache & Main Memory MCPI Comparison --- p.127<br>Chapter 5.5.1 --- Cache Size Effect --- p.130<br>Chapter 5.5.2 --- Cache Block Size Effect --- p.130<br>Chapter 5.5.3 --- Set Associativity Effect --- p.143<br>Chapter 6 --- Conclusion --- p.154<br>Chapter 7 --- Future Directions --- p.157<br>Chapter 7.1 --- Prefetch Buffer --- p.157<br>Chapter 7.2 --- Dissimilar L1-L2 Management --- p.158<br>Chapter 7.3 --- Combined LRU/MRU Replacement Policy --- p.160<br>Chapter 7.4 --- N Loops Look-ahead --- p.163
APA, Harvard, Vancouver, ISO, and other styles
49

Yu-ShiangChien and 錢郁翔. "Design of a Contention-aware Hybrid On-Chip Memory Management Mechanism." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/54741457670164006147.

Full text
Abstract:
碩士<br>國立成功大學<br>資訊工程學系碩博士班<br>101<br>Scratchpad memories (SPM) have been increasingly used in embedded system due to their higher energy and area efficiency compared to ordinary caches. Hybrid on-chip memory architecture that combines SPM with a mini cache is also proposed. In order to reduce off-chip memory access in hybrid on-chip memory architecture, some related works put the most frequently accessed data into SPM. However, these methods may be ineffective because the most frequently accessed data may be not the main cause of off-chip memory access. Instead, off-chip memory accesses are caused by cache miss, and reducing cache misses can reduce off-chip memory accesses. Therefore, in this work, we propose using cache miss as a criterion to determine whether a page should be moved to SPM. We propose a page miss bookkeeping circuit to calculate the number of cache misses happened in a page. When the number of misses in a page is higher than a threshold, the page is moved to SPM. Compared to cache on-chip memory architecture, experimental results show our method can reduce the energy delay production (EDP) by 49%. Compare to the work in [19], our method can reduce the EDP by 26%.
APA, Harvard, Vancouver, ISO, and other styles
50

Wang, Shiang-Fei, and 王湘斐. "Memory-Centric On-Chip Interconnection Network for Wireless Video Entertainment Systems." Thesis, 2010. http://ndltd.ncl.edu.tw/handle/71676070539312188684.

Full text
Abstract:
碩士<br>國立交通大學<br>電子研究所<br>98<br>A memory-centric on-chip interconnection network (OCIN) with an efficient network interface is realized in this thesis. Additionally, a borrowing mechanism is proposed to reduce head-of-line data blocking. Furthermore, for wireless video entertainment system, on-chip interconnection network (OCIN) provides the micro architecture and the building blocks, including network interfaces (NIs), routers and link wires. By considering the borrowed memory blocks and distributed memory management unit (d-MMU), the size of output queue in the NI can be dynamically scheduled. Based on the cycle-driven simulation results in SystemC, the proposed efficient NI can achieve performance improvement by 1.15x compared to the conventional NI. The blocking condition can be reduced 2%~4%. For on-demand memory system, we can efficiently reduce data blocking by adjusting buffer size and borrowed memory blocks. Under the condition of 70% blocking rate of receiver and 16 words of borrowed memory blocks, the data blocking reduction rate can reach 25%. With the proposed memory-centric OCIN, we can improve the data communication environment for wireless video entertainment systems.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!