Log in

Relevant bibliographies by topics / Computer architecture. Cache memory / Journal articles

To see the other types of publications on this topic, follow the link: Computer architecture. Cache memory.

Journal articles on the topic 'Computer architecture. Cache memory'

Author: Grafiati

Published: 4 June 2021

Last updated: 2 February 2022

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Computer architecture. Cache memory.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

DRACH, N., A. GEFFLAUT, P. JOUBERT, and A. SEZNEC. "ABOUT CACHE ASSOCIATIVITY IN LOW-COST SHARED MEMORY MULTI-MICROPROCESSORS." Parallel Processing Letters 05, no. 03 (September 1995): 475–87. http://dx.doi.org/10.1142/s0129626495000436.

Full text

Abstract:

Sizes of on-chip caches on current commercial microprocessors range from 16 Kbytes to 36 Kbytes. These microprocessors can be directly used in the design of a low cost single-bus shared memory multiprocessors without using any second-level cache. In this paper, we explore the viability of such a multi-microprocessor. Simulations results clearly establish that performance of such a system will be quite poor if on-chip caches are direct-mapped. On the other hand, when the on-chip caches are partially associative, the achieved level of performance is quite promising. In particular, two recently proposed innovative cache structures, the skewed-associative cache organization and the semi-unified cache organization are shown to work fine.

APA, Harvard, Vancouver, ISO, and other styles

2

ALVES, MARCO A. Z., HENRIQUE C. FREITAS, and PHILIPPE O. A. NAVAUX. "HIGH LATENCY AND CONTENTION ON SHARED L2-CACHE FOR MANY-CORE ARCHITECTURES." Parallel Processing Letters 21, no. 01 (March 2011): 85–106. http://dx.doi.org/10.1142/s0129626411000096.

Full text

Abstract:

Several studies point out the benefits of a shared L2 cache, but some other properties of shared caches must be considered to lead to a thorough understanding of all chip multiprocessor (CMP) bottlenecks. Our paper evaluates and explains shared cache bottlenecks, which are very important considering the rise of many-core processors. The results of our simulations with 32 cores show low performance when L2 cache memory is shared between 2 or 4 cores. In these two cases, the increase of L2 cache latency and contention are the main causes responsible for the increase of execution time.

APA, Harvard, Vancouver, ISO, and other styles

3

Struharik, Rastislav, and Vuk Vranjković. "Striping input feature map cache for reducing off-chip memory traffic in CNN accelerators." Telfor Journal 12, no. 2 (2020): 116–21. http://dx.doi.org/10.5937/telfor2002116s.

Full text

Abstract:

Data movement between the Convolutional Neural Network (CNN) accelerators and off-chip memory is critical concerning the overall power consumption. Minimizing power consumption is particularly important for low power embedded applications. Specific CNN computes patterns offer a possibility of significant data reuse, leading to the idea of using specialized on-chip cache memories which enable a significant improvement in power consumption. However, due to the unique caching pattern present within CNNs, standard cache memories would not be efficient. In this paper, a novel on-chip cache memory architecture, based on the idea of input feature map striping, is proposed, which requires significantly less on-chip memory resources compared to previously proposed solutions. Experiment results show that the proposed cache architecture can reduce on-chip memory size by a factor of 16 or more, while increasing power consumption no more than 15%, compared to some of the previously proposed solutions.

APA, Harvard, Vancouver, ISO, and other styles

4

Charrier, Dominic E., Benjamin Hazelwood, Ekaterina Tutlyaeva, Michael Bader, Michael Dumbser, Andrey Kudryavtsev, Alexander Moskovsky, and Tobias Weinzierl. "Studies on the energy and deep memory behaviour of a cache-oblivious, task-based hyperbolic PDE solver." International Journal of High Performance Computing Applications 33, no. 5 (April 15, 2019): 973–86. http://dx.doi.org/10.1177/1094342019842645.

Full text

Abstract:

We study the performance behaviour of a seismic simulation using the ExaHyPE engine with a specific focus on memory characteristics and energy needs. ExaHyPE combines dynamically adaptive mesh refinement (AMR) with ADER-DG. It is parallelized using tasks, and it is cache efficient. AMR plus ADER-DG yields a task graph which is highly dynamic in nature and comprises both arithmetically expensive tasks and tasks which challenge the memory’s latency. The expensive tasks and thus the whole code benefit from AVX vectorization, although we suffer from memory access bursts. A frequency reduction of the chip improves the code’s energy-to-solution. Yet, it does not mitigate burst effects. The bursts’ latency penalty becomes worse once we add Intel Optane technology, increase the core count significantly or make individual, computationally heavy tasks fall out of close caches. Thread overbooking to hide away these latency penalties becomes contra-productive with noninclusive caches as it destroys the cache and vectorization character. In cases where memory-intense and computationally expensive tasks overlap, ExaHyPE’s cache-oblivious implementation nevertheless can exploit deep, noninclusive, heterogeneous memory effectively, as main memory misses arise infrequently and slow down only few cores. We thus propose that upcoming supercomputing simulation codes with dynamic, inhomogeneous task graphs are actively supported by thread runtimes in intermixing tasks of different compute character, and we propose that future hardware actively allows codes to downclock the cores running particular task types.

APA, Harvard, Vancouver, ISO, and other styles

5

Kaplow, Wesley K., and Boleslaw K. Szymanski. "Compile-Time Cache Performance Prediction and Its Application to Tiling." Parallel Processing Letters 07, no. 04 (December 1997): 393–407. http://dx.doi.org/10.1142/s0129626497000395.

Full text

Abstract:

Tiling has been used by parallelizing compilers to define fine-grain parallel tasks and to optimize cache performance. In this paper we present a novel compile-time technique, called miss-driven cache simulation, for determining loop tile sizes that achieve the highest cache hit-rate. The widening disparity between a processor's peak instruction rate and main memory access time in modern computer systems makes this kind of optimization increasingly important for overall program efficiency. Our simulation technique generates only those references of a loop nest that may generate a cache memory miss and processes them on an architecturally accurate cache model at compile-time. Processing only a small portion of the memory reference trace of a program yields simulation speeds in the millions of memory references per second on workstations, with statistics of misses per reference and inter-reference interference counts gathered at the same time. These simulation speeds and statistics allow for the accurate analysis of the impact of cache optimizations at compile-time. We discuss the results of applying this method to guide loop tiling for such commonly used computational kernels as matrix multiplication and Jacobi iteration for various cache parameters.

APA, Harvard, Vancouver, ISO, and other styles

6

Wyland, David C. "Cache tag RAM chips simplify cache memory design." Microprocessors and Microsystems 14, no. 1 (January 1990): 47–57. http://dx.doi.org/10.1016/0141-9331(90)90013-l.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

Gan, Xin Biao, Li Shen, Quan Yuan Tan, Cong Liu, and Zhi Ying Wang. "Performance Evaluation and Optimization on GPU." Advanced Materials Research 219-220 (March 2011): 1445–49. http://dx.doi.org/10.4028/www.scientific.net/amr.219-220.1445.

Full text

Abstract:

GPU provides higher peak performance with hundreds of cores than CPU counterpart. However, it is a big challenge to take full advantage of their computing power. In order to understand performance bottlenecks of applications on many-core GPU and then optimize parallel programs on GPU architectures, we propose a performance evaluating model based on memory wall and then classify applications into AbM (Application bound-in Memory) and AbC (Application bound-in Computing). Furthermore, we optimize kernels characterized with low memory bandwidth including matrix multiplication and FFT (Fast Fourier Transform) by employing texture cache on NVIDIA GTX280 using CUDA (Compute Unified Device Architecture). Experimental results show that texture cache is helpful for AbM with better data locality, so it is critical to utilize GPU memory hierarchy efficiently for performance improvement.

APA, Harvard, Vancouver, ISO, and other styles

8

Dalui, Mamata, and Biplab K. Sikdar. "A Cache System Design for CMPs with Built-In Coherence Verification." VLSI Design 2016 (October 30, 2016): 1–16. http://dx.doi.org/10.1155/2016/8093614.

Full text

Abstract:

This work reports an effective design of cache system for Chip Multiprocessors (CMPs). It introduces built-in logic for verification of cache coherence in CMPs realizing directory based protocol. It is developed around the cellular automata (CA) machine, invented by John von Neumann in the 1950s. A special class of CA referred to as single length cycle 2-attractor cellular automata (TACA) has been planted to detect the inconsistencies in cache line states of processors’ private caches. The TACA module captures coherence status of the CMPs’ cache system and memorizes any inconsistent recording of the cache line states during the processors’ reference to a memory block. Theory has been developed to empower a TACA to analyse the cache state updates and then to settle to an attractor state indicating quick decision on a faulty recording of cache line status. The introduction of segmentation of the CMPs’ processor pool ensures a better efficiency, in determining the inconsistencies, by reducing the number of computation steps in the verification logic. The hardware requirement for the verification logic points to the fact that the overhead of proposed coherence verification module is much lesser than that of the conventional verification units and is insignificant with respect to the cost involved in CMPs’ cache system.

APA, Harvard, Vancouver, ISO, and other styles

9

Mohammad, Khader, Ahsan Kabeer, and Tarek Taha. "On-Chip Power Minimization Using Serialization-Widening with Frequent Value Encoding." VLSI Design 2014 (May 6, 2014): 1–14. http://dx.doi.org/10.1155/2014/801241.

Full text

Abstract:

In chip-multiprocessors (CMP) architecture, the L2 cache is shared by the L1 cache of each processor core, resulting in a high volume of diverse data transfer through the L1-L2 cache bus. High-performance CMP and SoC systems have a significant amount of data transfer between the on-chip L2 cache and the L3 cache of off-chip memory through the power expensive off-chip memory bus. This paper addresses the problem of the high-power consumption of the on-chip data buses, exploring a framework for memory data bus power consumption minimization approach. A comprehensive analysis of the existing bus power minimization approaches is provided based on the performance, power, and area overhead consideration. A novel approaches for reducing the power consumption for the on-chip bus is introduced. In particular, a serialization-widening (SW) of data bus with frequent value encoding (FVE), called the SWE approach, is proposed as the best power savings approach for the on-chip cache data bus. The experimental results show that the SWE approach with FVE can achieve approximately 54% power savings over the conventional bus for multicore applications using a 64-bit wide data bus in 45 nm technology.

APA, Harvard, Vancouver, ISO, and other styles

10

CHONG, FREDERIC T., and ANANT AGARWAL. "SHARED MEMORY VERSUS MESSAGE PASSING FOR ITERATIVE SOLUTION OF SPARSE, IRREGULAR PROBLEMS." Parallel Processing Letters 09, no. 01 (March 1999): 159–70. http://dx.doi.org/10.1142/s0129626499000177.

Full text

Abstract:

The benefits of hardware support for shared memory versus those for message passing are difficult to evaluate without an in-depth study of real applications on a common platform. We evaluate the communication mechanisms of the MIT Alewife machine, a multiprocessor which provides integrated cache-coherent shared memory, massage passing, and DMA. We perform this evaluation with "best-effort" implementations which solve several sparse, irregular benchmark problems with a preconditioned conjugate gradient sparse matrix solver (ICCG). We find that machines with fast global memory operations do not need message passing or bulk transfer to suport our irregular problems. This is primarily due to three reasons. First, a 5-to-1 ratio between global and local cache misses makes memory copies in bulk communication expensive relati to communication via shared memory. Second, although message passing has synchronization semantics superior to shared memory for data-driven computation, efficient shared memory can overcome this handicap by using global read-modify-writes to change from the traditional owner-computers model to a producer-computes model. Third, bulk transfers can result in high processor idle times in irregular applications.

APA, Harvard, Vancouver, ISO, and other styles

11

Iyer, Ravi, Li Zhao, Fei Guo, Ramesh Illikkal, Srihari Makineni, Don Newell, Yan Solihin, Lisa Hsu, and Steve Reinhardt. "QoS policies and architecture for cache/memory in CMP platforms." ACM SIGMETRICS Performance Evaluation Review 35, no. 1 (June 12, 2007): 25–36. http://dx.doi.org/10.1145/1269899.1254886.

Full text

APA, Harvard, Vancouver, ISO, and other styles

12

Picano, Silvio, Eugene D. Brooks III, and Joseph E. Hoag. "Assessing Programming Costs of Explicit Memory Localization on a Large Scale Shared Memory Multiprocessor." Scientific Programming 1, no. 1 (1992): 67–78. http://dx.doi.org/10.1155/1992/923069.

Full text

Abstract:

We present detailed experimental work involving a commercially available large scale shared memory multiple instruction stream-multiple data stream (MIMD) parallel computer having a software controlled cache coherence mechanism. To make effective use of such an architecture, the programmer is responsible for designing the program's structure to match the underlying multiprocessors capabilities. We describe the techniques used to exploit our multiprocessor (the BBN TC2000) on a network simulation program, showing the resulting performance gains and the associated programming costs. We show that an efficient implementation relies heavily on the user's ability to explicitly manage the memory system.

APA, Harvard, Vancouver, ISO, and other styles

13

Rotithor, H. G. "On the effective use of a cache memory simulator in a computer architecture course." IEEE Transactions on Education 38, no. 4 (1995): 357–60. http://dx.doi.org/10.1109/13.473156.

Full text

APA, Harvard, Vancouver, ISO, and other styles

14

Keen, D., M. Oskin, J. Hensley, and F. T. Chong. "Cache coherence in intelligent memory systems." IEEE Transactions on Computers 52, no. 7 (July 2003): 960–66. http://dx.doi.org/10.1109/tc.2003.1214343.

Full text

APA, Harvard, Vancouver, ISO, and other styles

15

Hondroulis, Antonis, Costas Harizakis, and Peter Triantafillou. "Optimal Cache Memory Exploitation for Continuous Media: To Cache or to Prefetch?" Multimedia Tools and Applications 23, no. 3 (August 2004): 203–20. http://dx.doi.org/10.1023/b:mtap.0000031757.02159.ac.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Пуйденко, Вадим Олексійович, and Вячеслав Сергійович Харченко. "МІНІМІЗАЦІЯ ЛОГІЧНОЇ СХЕМИ ДЛЯ РЕАЛІЗАЦІЇ PSEUDO LRU ШЛЯХОМ МІЖТИПОВОГО ПЕРЕХОДУ У ТРИГЕРНИХ СТРУКТУРАХ." RADIOELECTRONIC AND COMPUTER SYSTEMS, no. 2 (April 26, 2020): 33–47. http://dx.doi.org/10.32620/reks.2020.2.03.

Full text

Abstract:

The principle of program control means that the processor core turns to the main memory of the computer for operands or instructions. According to architectural features, operands are stored in data segments, and instructions are stored in code segments of the main memory. The operating system uses both page memory organization and segment memory organization. The page memory organization is always mapped to the segment organization. Due to the cached packet cycles of the processor core, copies of the main memory pages are stored in the internal associative cache memory. The associative cache memory consists of three units: a data unit, a tag unit, and an LRU unit. The data unit stores operands or instructions, the tag unit contains fragments of address information, and the LRU unit contains the logic of policy for replacement of string. The missing event attracts LRU logic to decide for substitution of reliable string in the data unit of associative cache memory. The pseudo-LRU algorithm is a simple and better substitution policy among known substitution policies. Two options for the minimization of the hardware for replacement policy by the pseudo-LRU algorithm in q - directed associative cache memory is implemented. The transition from the trigger structure of the synchronous D-trigger to the trigger structure of the synchronous JK-trigger is carried out reasonably in both options. The first option of minimization is based on the sequence for updating of the by the algorithm pseudo LRU, which allows deleting of the combinational logic for updating bits of LRU unit. The second option of minimization is based on the sequence for changing of the q - index of direction, as the consequence for updating the bits of LRU unit by the algorithm pseudo LRU. It allows additionally reducing the number of memory elements. Both options of the minimization allow improving such characteristics as productivity and reliability of the LRU unit.

APA, Harvard, Vancouver, ISO, and other styles

17

Pahikkala, Tapio, Antti Airola, Thomas Canhao Xu, Pasi Liljeberg, Hannu Tenhunen, and Tapio Salakoski. "Parallelized Online Regularized Least-Squares for Adaptive Embedded Systems." International Journal of Embedded and Real-Time Communication Systems 3, no. 2 (April 2012): 73–91. http://dx.doi.org/10.4018/jertcs.2012040104.

Full text

Abstract:

The authors introduce a machine learning approach based on parallel online regularized least-squares learning algorithm for parallel embedded hardware platforms. The system is suitable for use in real-time adaptive systems. Firstly, the system can learn in online fashion, a property required in real-life applications of embedded machine learning systems. Secondly, to guarantee real-time response in embedded multi-core computer architectures, the learning system is parallelized and able to operate with a limited amount of computational and memory resources. Thirdly, the system can predict several labels simultaneously. The authors evaluate the performance of the algorithm from three different perspectives. The prediction performance is evaluated on a hand-written digit recognition task. The computational speed is measured from 1 thread to 4 threads, in a quad-core platform. As a promising unconventional multi-core architecture, Network-on-Chip platform is studied for the algorithm. The authors construct a NoC consisting of a 4x4 mesh. The machine learning algorithm is implemented in this platform with up to 16 threads. It is shown that the memory consumption and cache efficiency can be considerably improved by optimizing the cache behavior of the system. The authors’ results provide a guideline for designing future embedded multi-core machine learning devices.

APA, Harvard, Vancouver, ISO, and other styles

18

German, Steven M. "Formal Design of Cache Memory Protocols in IBM." Formal Methods in System Design 22, no. 2 (March 2003): 133–41. http://dx.doi.org/10.1023/a:1022921522163.

Full text

APA, Harvard, Vancouver, ISO, and other styles

19

XU, Hongjie, Jun SHIOMI, Tohru ISHIHARA, and Hidetoshi ONODERA. "On-Chip Cache Architecture Exploiting Hybrid Memory Structures for Near-Threshold Computing." IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences E102.A, no. 12 (December 1, 2019): 1741–50. http://dx.doi.org/10.1587/transfun.e102.a.1741.

Full text

APA, Harvard, Vancouver, ISO, and other styles

20

Meixner, A., and D. J. Sorin. "Dynamic Verification of Memory Consistency in Cache-Coherent Multithreaded Computer Architectures." IEEE Transactions on Dependable and Secure Computing 6, no. 1 (January 2009): 18–31. http://dx.doi.org/10.1109/tdsc.2007.70243.

Full text

APA, Harvard, Vancouver, ISO, and other styles

21

Ramachandran, Umakishore, and Joonwon Lee. "Cache-Based Synchronization in Shared Memory Multiprocessors." Journal of Parallel and Distributed Computing 32, no. 1 (January 1996): 11–27. http://dx.doi.org/10.1006/jpdc.1996.0002.

Full text

APA, Harvard, Vancouver, ISO, and other styles

22

Al-Kharusi, Ibrahim, and David W. Walker. "Locality properties of 3D data orderings with application to parallel molecular dynamics simulations." International Journal of High Performance Computing Applications 33, no. 5 (May 19, 2019): 998–1018. http://dx.doi.org/10.1177/1094342019846282.

Full text

Abstract:

Application performance on graphical processing units (GPUs), in terms of execution speed and memory usage, depends on the efficient use of hierarchical memory. It is expected that enhancing data locality in molecular dynamic simulations will lower the cost of data movement across the GPU memory hierarchy. The work presented in this article analyses the spatial data locality and data reuse characteristics for row-major, Hilbert and Morton orderings and the impact these have on the performance of molecular dynamics simulations. A simple cache model is presented, and this is found to give results that are consistent with the timing results for the particle force computation obtained on NVidia GeForce GTX960 and Tesla P100 GPUs. Further analysis of the observed memory use, in terms of cache hits and the number of memory transactions, provides a more detailed explanation of execution behaviour for the different orderings. To the best of our knowledge, this is the first study to investigate memory analysis and data locality issues for molecular dynamics simulations of Lennard-Jones fluids on NVidia’s Maxwell and Tesla architectures.

APA, Harvard, Vancouver, ISO, and other styles

23

Hać, Anna. "Design algorithms for asynchronous operations in cache memory." ACM SIGMETRICS Performance Evaluation Review 16, no. 2-4 (February 1989): 21. http://dx.doi.org/10.1145/1041911.1041914.

Full text

APA, Harvard, Vancouver, ISO, and other styles

24

Kyriacou, Costas, Paraskevas Evripidou, and Pedro Trancoso. "CacheFlow: Cache Optimizations for Data Driven Multithreading." Parallel Processing Letters 16, no. 02 (June 2006): 229–44. http://dx.doi.org/10.1142/s0129626406002599.

Full text

Abstract:

Data-Driven Multithreading is a non-blocking multithreading model of execution that provides effective latency tolerance by allowing the computation processor do useful work, while a long latency event is in progress. With the Data-Driven Multithreading model, a thread is scheduled for execution only if all of its inputs have been produced and placed in the processor's local memory. Data-driven sequencing leads to irregular memory access patterns that could affect negatively cache performance. Nevertheless, it enables the implementation of short-term optimal cache management policies. This paper presents the implementation of CacheFlow, an optimized cache management policy which eliminates the side effects due to the loss of locality caused by the data-driven sequencing, and reduces further cache misses. CacheFlow employs thread-based prefetching to preload data blocks of threads deemed executable. Simulation results, for nine scientific applications, on a 32-node Data-Driven Multithreaded machine show an average speedup improvement from 19.8 to 22.6. Two techniques to further improve the performance of CacheFlow, conflict avoidance and thread reordering, are proposed and tested. Simulation experiments have shown a speedup improvement of 24% and 32%, respectively. The average speedup for all applications on a 32-node machine with both optimizations is 26.1.

APA, Harvard, Vancouver, ISO, and other styles

25

Sun, Guangyu, Chao Zhang, Peng Li, Tao Wang, and Yiran Chen. "Statistical Cache Bypassing for Non-Volatile Memory." IEEE Transactions on Computers 65, no. 11 (November 1, 2016): 3427–40. http://dx.doi.org/10.1109/tc.2016.2529621.

Full text

APA, Harvard, Vancouver, ISO, and other styles

26

Frigo, Matteo, and Volker Strumpen. "The memory behavior of cache oblivious stencil computations." Journal of Supercomputing 39, no. 2 (February 21, 2007): 93–112. http://dx.doi.org/10.1007/s11227-007-0111-y.

Full text

APA, Harvard, Vancouver, ISO, and other styles

27

Banu, J. Saira, and M. Rajasekhara Babu. "Exploring Vectorization and Prefetching Techniques on Scientific Kernels and Inferring the Cache Performance Metrics." International Journal of Grid and High Performance Computing 7, no. 2 (April 2015): 18–36. http://dx.doi.org/10.4018/ijghpc.2015040102.

Full text

Abstract:

Performance improvement in modern processor is staggering due to power wall and memory wall problem. In general, the power wall problem is addressed by various vectorization design techniques. The Memory wall problem is diminished through prefetching technique. In this paper vectorization is achieved through Single Instruction Multiple Data (SIMD) registers of the current processor. It provides architecture optimization by reducing the number of instructions in the pipeline and by minimizing the utilization of multi-level memory hierarchy. These registers provide an economical computing platform compared to Graphics Processing Unit (GPU) for compute intensive applications. This paper explores software prefetching via Streaming SIMD extension (SSE) instructions to mitigate the memory wall problem. This work quantifies the effect of vectorization and prefetching in Matrix Vector Multiplication (MVM) kernel with dense and sparse structure. Both Prefetching and Vectorization method reduces the data and instruction cache pressure and thereby improving the cache performance. To show the cache performance improvements in the kernel, the Intel VTune amplifier is used. Finally, experimental results demonstrate a promising performance of matrix kernel by Intel Haswell's processor. However, effective utilization of SIMD registers is a programming challenge to the developers.

APA, Harvard, Vancouver, ISO, and other styles

28

Dahlgren, F., and P. Stenstrom. "Using Write Caches to Improve Performance of Cache Coherence Protocols in Shared-Memory Multiprocessors." Journal of Parallel and Distributed Computing 26, no. 2 (April 1995): 193–210. http://dx.doi.org/10.1006/jpdc.1995.1059.

Full text

APA, Harvard, Vancouver, ISO, and other styles

29

Petersen, K., and K. Li. "Multiprocessor Cache Coherence Based on Virtual Memory Support." Journal of Parallel and Distributed Computing 29, no. 2 (September 1995): 158–78. http://dx.doi.org/10.1006/jpdc.1995.1115.

Full text

APA, Harvard, Vancouver, ISO, and other styles

30

Lopriore, Lanfranco. "Line fetch/prefetch in a stack cache memory." Microprocessors and Microsystems 17, no. 9 (November 1993): 547–55. http://dx.doi.org/10.1016/s0141-9331(09)91006-7.

Full text

APA, Harvard, Vancouver, ISO, and other styles

31

Torrellas, Josep, Andrew Tucker, and Anoop Gupta. "Benefits of cache-affinity scheduling in shared-memory multiprocessors." ACM SIGMETRICS Performance Evaluation Review 21, no. 1 (June 1993): 272–74. http://dx.doi.org/10.1145/166962.167038.

Full text

APA, Harvard, Vancouver, ISO, and other styles

32

Cruz, Eduardo H. M., Matthias Diener, Laércio L. Pilla, and Philippe O. A. Navaux. "Online Thread and Data Mapping Using a Sharing-Aware Memory Management Unit." ACM Transactions on Modeling and Performance Evaluation of Computing Systems 5, no. 4 (March 2021): 1–28. http://dx.doi.org/10.1145/3433687.

Full text

Abstract:

Current and future architectures rely on thread-level parallelism to sustain performance growth. These architectures have introduced a complex memory hierarchy, consisting of several cores organized hierarchically with multiple cache levels and NUMA nodes. These memory hierarchies can have an impact on the performance and energy efficiency of parallel applications as the importance of memory access locality is increased. In order to improve locality, the analysis of the memory access behavior of parallel applications is critical for mapping threads and data. Nevertheless, most previous work relies on indirect information about the memory accesses, or does not combine thread and data mapping, resulting in less accurate mappings. In this paper, we propose the Sharing-Aware Memory Management Unit (SAMMU), an extension to the memory management unit that allows it to detect the memory access behavior in hardware. With this information, the operating system can perform online mapping without any previous knowledge about the behavior of the application. In the evaluation with a wide range of parallel applications (NAS Parallel Benchmarks and PARSEC Benchmark Suite), performance was improved by up to 35.7% (10.0% on average) and energy efficiency was improved by up to 11.9% (4.1% on average). These improvements happened due to a substantial reduction of cache misses and interconnection traffic.

APA, Harvard, Vancouver, ISO, and other styles

33

Mavriplis, Dimitri J. "Parallel Performance Investigations of an Unstructured Mesh Navier-Stokes Solver." International Journal of High Performance Computing Applications 16, no. 4 (November 2002): 395–407. http://dx.doi.org/10.1177/109434200201600403.

Full text

Abstract:

Summary The implementation and performance of a hybrid OpenMP/ MPI parallel communication strategy for an unstructured mesh computational fluid dynamics code is described. The solver is cache efficient and fully vectorizable, and is parallelized using a two-level hybrid MPI-OpenMP implementation suitable for shared and/or distributed memory architectures, as well as clusters of shared memory machines. Parallelism is obtained through domain decomposition for both communication models. Single processor computational rates as well as scalability curves are given on various architectures. For the architectures studied in this work, the OpenMP or hybrid OpenMP/MPI communication strategies achieved no appreciable performance benefit over an exclusive MPI communication strategy.

APA, Harvard, Vancouver, ISO, and other styles

34

Hassan, Muhammad, Chang Hyun Park, and David Black-Schaffer. "A Reusable Characterization of the Memory System Behavior of SPEC2017 and SPEC2006." ACM Transactions on Architecture and Code Optimization 18, no. 2 (March 2021): 1–20. http://dx.doi.org/10.1145/3446200.

Full text

Abstract:

The SPEC CPU Benchmarks are used extensively for evaluating and comparing improvements to computer systems. This ubiquity makes characterization critical for researchers to understand the bottlenecks the benchmarks do and do not expose and where new designs should and should not be expected to show impact. However, in characterization there is a tradeoff between accuracy and reusability: The more precisely we characterize a benchmark’s performance on a given system, the less usable it is across different micro-architectures and varying memory configurations. For SPEC, most existing characterizations include system-specific effects (e.g., via performance counters) and/or only look at aggregate behavior (e.g., averages over the full application execution). While such approaches simplify characterization, they make it difficult to separate the applications’ intrinsic behavior from the system-specific effects and/or lose the diverse phase-based behaviors. In this work we focus on characterizing the applications’ intrinsic memory behaviour by isolating them from micro-architectural configuration specifics. We do this by providing a simplified generic system model that evaluates the applications’ memory behavior across multiple cache sizes, with and without prefetching, and over time. The resulting characterization can be reused across a range of systems to understand application behavior and allow us to see how frequently different behaviors occur. We use this approach to compare the SPEC 2006 and 2017 suites, providing insight into their memory system behaviour beyond previous system-specific and/or aggregate results. We demonstrate the ability to use this characterization in different contexts by showing a portion of the SPEC 2017 benchmark suite that could benefit from giga-scale caches, despite aggregate results indicating otherwise.

APA, Harvard, Vancouver, ISO, and other styles

35

Wittenbrink, C. M., A. K. Somani, and Chung-Ho Chen. "Cache write generate for parallel image processing on shared memory architectures." IEEE Transactions on Image Processing 5, no. 7 (July 1996): 1204–8. http://dx.doi.org/10.1109/83.502410.

Full text

APA, Harvard, Vancouver, ISO, and other styles

36

Matsumoto, Akira, Takayuki Nakagawa, Masatoshi Sato, Yasunori Kimura, Kenji Nishida, and Atsuhiro Goto. "Locally parallel cache design based on KL1 memory access characteristics." New Generation Computing 9, no. 2 (June 1991): 149–69. http://dx.doi.org/10.1007/bf03037641.

Full text

APA, Harvard, Vancouver, ISO, and other styles

37

Zhang, Lei, Reza Karimi, Irfan Ahmad, and Ymir Vigfusson. "Optimal Data Placement for Heterogeneous Cache, Memory, and Storage Systems." ACM SIGMETRICS Performance Evaluation Review 48, no. 1 (July 8, 2020): 85–86. http://dx.doi.org/10.1145/3410048.3410098.

Full text

APA, Harvard, Vancouver, ISO, and other styles

38

Jog, Rajeev, Philip L. Vitale, and James R. Callister. "Performance evaluation of a commercial cache-coherent shared memory multiprocessor." ACM SIGMETRICS Performance Evaluation Review 18, no. 1 (April 1990): 173–82. http://dx.doi.org/10.1145/98460.98756.

Full text

APA, Harvard, Vancouver, ISO, and other styles

39

Li, Xiaochang, and Zhengjun Zhai. "UHNVM: A Universal Heterogeneous Cache Design with Non-Volatile Memory." Electronics 10, no. 15 (July 22, 2021): 1760. http://dx.doi.org/10.3390/electronics10151760.

Full text

Abstract:

During the recent decades, non-volatile memory (NVM) has been anticipated to scale up the main memory size, improve the performance of applications, and reduce the speed gap between main memory and storage devices, while supporting persistent storage to cope with power outages. However, to fit NVM, all existing DRAM-based applications have to be rewritten by developers. Therefore, the developer must have a good understanding of targeted application codes, so as to manually distinguish and store data fit for NVM. In order to intelligently facilitate NVM deployment for existing legacy applications, we propose a universal heterogeneous cache hierarchy which is able to automatically select and store the appropriate data of applications for non-volatile memory (UHNVM), without compulsory code understanding. In this article, a program context (PC) technique is proposed in the user space to help UHNVM to classify data. Comparing to the conventional hot or cold files categories, the PC technique can categorize application data in a fine-grained manner, enabling us to store them either in NVM or SSDs efficiently for better performance. Our experimental results using a real Optane dual-inline-memory-module (DIMM) card show that our new heterogeneous architecture reduces elapsed times by about 11% compared to the conventional kernel memory configuration without NVM.

APA, Harvard, Vancouver, ISO, and other styles

40

Rajan, Mahesh, Douglas Doerfler, Courtenay T. Vaughan, Marcus Epperson, and Jeff Ogden. "Application Performance on the Tri-Lab Linux Capacity Cluster - TLCC." International Journal of Distributed Systems and Technologies 1, no. 2 (April 2010): 23–39. http://dx.doi.org/10.4018/jdst.2010040102.

Full text

Abstract:

In a recent acquisition by DOE/NNSA several large capacity computing clusters called TLCC have been installed at the DOE labs: SNL, LANL and LLNL. TLCC architecture with ccNUMA, multi-socket, multi-core nodes, and InfiniBand interconnect, is representative of the trend in HPC architectures. This paper examines application performance on TLCC contrasting them with Red Storm/Cray XT4. TLCC and Red Storm share similar AMD processors and memory DIMMs. Red Storm however has single socket nodes and custom interconnect. Micro-benchmarks and performance analysis tools help understand the causes for the observed performance differences. Control of processor and memory affinity on TLCC with the numactl utility is shown to result in significant performance gains and is essential to attenuate the detrimental impact of OS interference and cache-coherency overhead. While previous studies have investigated impact of affinity control mostly in the context of small SMP systems, the focus of this paper is on highly parallel MPI applications.

APA, Harvard, Vancouver, ISO, and other styles

41

Duarte, Filipa, and Stephan Wong. "Cache-Based Memory Copy Hardware Accelerator for Multicore Systems." IEEE Transactions on Computers 59, no. 11 (November 2010): 1494–507. http://dx.doi.org/10.1109/tc.2010.41.

Full text

APA, Harvard, Vancouver, ISO, and other styles

42

Miyachi, Taizo, Akitoshi Mitsuishi, and Tetsuo Mizoguchi. "Performance Evaluation for Memory Subsystem of Hierarchical Disk-Cache." Systems and Computers in Japan 17, no. 7 (1986): 86–94. http://dx.doi.org/10.1002/scj.4690170710.

Full text

APA, Harvard, Vancouver, ISO, and other styles

43

Graf, Susanne. "Characterization of a sequentially consistent memory and verification of a cache memory by abstraction." Distributed Computing 12, no. 2-3 (May 1, 1999): 75–90. http://dx.doi.org/10.1007/s004460050059.

Full text

APA, Harvard, Vancouver, ISO, and other styles

44

Bethel, E. Wes, and Mark Howison. "Multi-core and many-core shared-memory parallel raycasting volume rendering optimization and tuning." International Journal of High Performance Computing Applications 26, no. 4 (April 3, 2012): 399–412. http://dx.doi.org/10.1177/1094342012440466.

Full text

Abstract:

Given the computing industry trend of increasing processing capacity by adding more cores to a chip, the focus of this work is tuning the performance of a staple visualization algorithm, raycasting volume rendering, for shared-memory parallelism on multi-core CPUs and many-core GPUs. Our approach is to vary tunable algorithmic settings, along with known algorithmic optimizations and two different memory layouts, and measure performance in terms of absolute runtime and L2 memory cache misses. Our results indicate there is a wide variation in runtime performance on all platforms, as much as 254% for the tunable parameters we test on multi-core CPUs and 265% on many-core GPUs, and the optimal configurations vary across platforms, often in a non-obvious way. For example, our results indicate the optimal configurations on the GPU occur at a crossover point between those that maintain good cache utilization and those that saturate computational throughput. This result is likely to be extremely difficult to predict with an empirical performance model for this particular algorithm because it has an unstructured memory access pattern that varies locally for individual rays and globally for the selected viewpoint. Our results also show that optimal parameters on modern architectures are markedly different from those in previous studies run on older architectures. In addition, given the dramatic performance variation across platforms for both optimal algorithm settings and performance results, there is a clear benefit for production visualization and analysis codes to adopt a strategy for performance optimization through auto-tuning. These benefits will likely become more pronounced in the future as the number of cores per chip and the cost of moving data through the memory hierarchy both increase.

APA, Harvard, Vancouver, ISO, and other styles

45

Esakkimuthu, G., H. S. Kim, M. Kandemir, N. Vijaykrishnan, and M. J. Irwin. "Investigating Memory System Energy Behavior Using Software and Hardware Optimizations." VLSI Design 12, no. 2 (January 1, 2001): 151–65. http://dx.doi.org/10.1155/2001/70310.

Full text

Abstract:

Memory system usually consumes a significant amount of energy in many battery-operated devices. In this paper, we provide a quantitative comparison and evaluation of the interaction of two hardware cache optimization mechanisms and three widely used compiler optimization techniques used to reduce the memory system energy. Our presentation is in two parts. First, we focus on a set of memory-intensive benchmark codes and investigate their memory system energy behavior due to data accesses under hardware and compiler optimizations. Then, using four motion estimation codes, we look at the influence of compiler optimizations on the memory system energy considering the overall impact of instruction and data accesses.

APA, Harvard, Vancouver, ISO, and other styles

46

Torrellas, J., A. Tucker, and A. Gupta. "Evaluating the Performance of Cache-Affinity Scheduling in Shared-Memory Multiprocessors." Journal of Parallel and Distributed Computing 24, no. 2 (February 1995): 139–51. http://dx.doi.org/10.1006/jpdc.1995.1014.

Full text

APA, Harvard, Vancouver, ISO, and other styles

47

Poulsen, David K., and Pen-Chung Yew. "Integrating Fine-Grained Message Passing in Cache Coherent Shared Memory Multiprocessors." Journal of Parallel and Distributed Computing 33, no. 2 (March 1996): 172–88. http://dx.doi.org/10.1006/jpdc.1996.0036.

Full text

APA, Harvard, Vancouver, ISO, and other styles

48

Cheng, Ching-Hwa. "Design Example of Useful Memory Latency for Developing a Hazard Preventive Pipeline High-Performance Embedded-Microprocessor." VLSI Design 2013 (July 22, 2013): 1–10. http://dx.doi.org/10.1155/2013/425105.

Full text

Abstract:

The existence of structural, control, and data hazards presents a major challenge in designing an advanced pipeline/superscalar microprocessor. An efficient memory hierarchy cache-RAM-Disk design greatly enhances the microprocessor's performance. However, there are complex relationships among the memory hierarchy and the functional units in the microprocessor. Most past architectural design simulations focus on the instruction hazard detection/prevention scheme from the viewpoint of function units. This paper emphasizes that additional inboard memory can be well utilized to handle the hazardous conditions. When the instruction meets hazardous issues, the memory latency can be utilized to prevent performance degradation due to the hazard prevention mechanism. By using the proposed technique, a better architectural design can be rapidly validated by an FPGA at the start of the design stage. In this paper, the simulation results prove that our proposed methodology has a better performance and less power consumption compared to the conventional hazard prevention technique.

APA, Harvard, Vancouver, ISO, and other styles

49

Giraud, L. "Combining Shared and Distributed Memory Programming Models on Clusters of Symmetric Multiprocessors: Some Basic Promising Experiments." International Journal of High Performance Computing Applications 16, no. 4 (November 2002): 425–30. http://dx.doi.org/10.1177/109434200201600405.

Full text

Abstract:

This note presents some experiments on different clusters of SMPs, where both distributed and shared memory parallel programming paradigms can be naturally combined. Although the platforms exhibit the same macroscopic memory organization, it appears that their individual overall performance is closely dependent on the ability of their hardware to efficiently exploit the local shared memory within the nodes. In that context, cache blocking strategy appears to be very important not only to get good performance out of each individual processor but mainly good performance out of the overall computing node since sharing memory locally might become a severe bottleneck. On a very simple benchmark, representative of many large simulation codes, we show through numerical experiments that mixing the two programming models enables us to get attractive speed-ups that compete with a pure distributed memory approach. This opens promising perspectives for smoothly moving large industrial codes developed on distributed vector computers with a moderate number of processors on these emerging platforms for intensive scientific computing that are the clusters of SMPs.

APA, Harvard, Vancouver, ISO, and other styles

50

Gong, Young-Ho. "Monolithic 3D-Based SRAM/MRAM Hybrid Memory for an Energy-Efficient Unified L2 TLB-Cache Architecture." IEEE Access 9 (2021): 18915–26. http://dx.doi.org/10.1109/access.2021.3054021.

Full text

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!