Dissertations / Theses: 'Computer Memory Architecture'

1

Scrbak, Marko. "Methodical Evaluation of Processing-in-Memory Alternatives." Thesis, University of North Texas, 2019. https://digital.library.unt.edu/ark:/67531/metadc1505199/.

Full text

Abstract:

In this work, I characterized a series of potential application kernels using a set of architectural and non-architectural metrics, and performed a comparison of four different alternatives for processing-in-memory cores (PIMs): ARM cores, GPGPUs, coarse-grained reconfigurable dataflow (DF-PIM), and a domain specific architecture using SIMD PIM engine consisting of a series of multiply-accumulate circuits (MACs). For each PIM alternative I investigated how performance and energy efficiency changes with respect to a series of system parameters, such as memory bandwidth and latency, number of PIM cores, DVFS states, cache architecture, etc. In addition, I compared the PIM core choices for a subset of applications and discussed how the application characteristics correlate to the achieved performance and energy efficiency. Furthermore, I compared the PIM alternatives to a host-centric solution that uses a traditional server-class CPU core or PIM-like cores acting as host-side accelerators instead of being part of 3D-stacked memories. Such insights can expose the achievable performance limits and shortcomings of certain PIM designs and show sensitivity to a series of system parameters (available memory bandwidth, application latency and bandwidth sensitivity, etc.). In addition, identifying the common application characteristics for PIM kernels provides opportunity to identify similar types of computation patterns in other applications and allows us to create a set of applications which can then be used as benchmarks for evaluating future PIM design alternatives.

APA, Harvard, Vancouver, ISO, and other styles

2

Lee, Joonwon. "Architectural features for Scalable shared memory multiprocessors." Diss., Georgia Institute of Technology, 1991. http://hdl.handle.net/1853/8200.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Rixner, Scott. "Memory system architecture for real-time multitasking systems." Thesis, Massachusetts Institute of Technology, 1995. http://hdl.handle.net/1721.1/36599.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Rankin, Linda J. "A dual-ported real memory architecture for the g-machine." Full text open access at:, 1986. http://content.ohsu.edu/u?/etd,117.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Chi, Hsiang. "Flash memory boot block architecture for safe firmware updates." FIU Digital Commons, 1995. http://digitalcommons.fiu.edu/etd/2160.

Full text

Abstract:

The most significant risk of updating embedded system code is the possible loss of system firmware during the update process. If the firmware is lost, the system will cease to operate, which can be very costly to the end user. This thesis is concerned with exploring alternate architectures which exploit the integration of flash memory technology in order to overcome this problem. Three design models and associated software techniques will be presented. These design models are described in detail in terms of the strategies they employ in order to prevent system lockup and the loss of firmware. The most important objective, which is addressed in the third model, is to ensure that the system can continue to process interrupts during the update. In addition, a portion of this research was aimed at providing the capability to perform updates remotely, and at maximizing system code memory space and available system RAM.

APA, Harvard, Vancouver, ISO, and other styles

6

Khurana, Harneet Singh. "Memory and data communication link architecture for micro-implants." Thesis, Massachusetts Institute of Technology, 2010. http://hdl.handle.net/1721.1/57686.

Full text

Abstract:

Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.
Cataloged from PDF version of thesis.
Includes bibliographical references (p. 89).
With the growing need for the development of smaller implantable monitors, alternative energy storage sources such as high density ultra capacitors are envisioned to replace the bulky batteries in these devices. Ultracapacitors have the potential to be integrated on a silicon wafer, and have the benefits of an unlimited number of recharge cycles and extremely rapid recharging times. However, they present an essential challenge in that the voltage drops rapidly with energy drain. In this thesis, we explore a data storage memory that is compatible with ultracapacitor energy storage. In addition, we propose and demonstrate a low-power wireless link that exploits RFID techniques as a way of uploading the stored data.
by Harneet Singh Khurana.
S.M.

APA, Harvard, Vancouver, ISO, and other styles

7

Mui, Eric Y. (Eric Yeeming) 1976. "Optimizing memory access for the Architecture Exploration System (ARIES)." Thesis, Massachusetts Institute of Technology, 2000. http://hdl.handle.net/1721.1/86536.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Kamolpornwijit, Witchakorn. "P-TAXI : enforcing memory safety with programmable tagged architecture." Thesis, Massachusetts Institute of Technology, 2016. http://hdl.handle.net/1721.1/105996.

Full text

Abstract:

Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2016.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 104-112).
Buffer overflow is a well-known problem that remains a threat to software security. With the advancement of code-reuse attacks and return-oriented programming (ROP), it becomes problematic to protect a program from being compromised. Several defenses have been developed in an attempt to defeat code-reuse attacks. However, there is still no solution that provides complete protection with low overhead. In this thesis, we improved TAXI, a ROP defense technique that utilizes a tagged architecture to prevent memory violations. Inspired by Programmable Unit for Metadata Processing (PUMP), we modified TAXI so that enforcement policies can be programmed by user-level code and called it P-TAXI (Programmable TAXI). We demonstrated that, by using P-TAXI, we were able to enforce memory safety policies, including return address protection, stack garbage collection, and memory compartmentalization. In addition, we showed that P-TAXI can be used for debugging and taint tracking.
by Witchakorn Kamolpornwijit.
M. Eng.

APA, Harvard, Vancouver, ISO, and other styles

9

Ainsworth, Sam. "Prefetching for complex memory access patterns." Thesis, University of Cambridge, 2018. https://www.repository.cam.ac.uk/handle/1810/277804.

Full text

Abstract:

Modern-day workloads, particularly those in big data, are heavily memory-latency bound. This is because of both irregular memory accesses, which have no discernable pattern in their memory addresses, and large data sets that cannot fit in any cache. However, this need not be a barrier to high performance. With some data structure knowledge it is typically possible to bring data into the fast on-chip memory caches early, so that it is already available by the time it needs to be accessed. This thesis makes three contributions. I first contribute an automated software prefetching compiler technique to insert high-performance prefetches into program code to bring data into the cache early, achieving 1.3x geometric mean speedup on the most complex processors, and 2.7x on the simplest. I also provide an analysis of when and why this is likely to be successful, which data structures to target, and how to schedule software prefetches well. Then I introduce a hardware solution, the configurable graph prefetcher. This uses the example of breadth-first search on graph workloads to motivate how a hardware prefetcher armed with data-structure knowledge can avoid the instruction overheads, inflexibility and limited latency tolerance of software prefetching. The configurable graph prefetcher sits at the L1 cache and observes memory accesses, which can be configured by a programmer to be aware of a limited number of different data access patterns, achieving 2.3x geometric mean speedup on graph workloads on an out-of-order core. My final contribution extends the hardware used for the configurable graph prefetcher to make an event-triggered programmable prefetcher, using a set of a set of very small micro-controller-sized programmable prefetch units (PPUs) to cover a wide set of workloads. I do this by developing a highly parallel programming model that can be used to issue prefetches, thus allowing high-throughput prefetching with low power and area overheads of only around 3%, and a 3x geometric mean speedup for a variety of memory-bound applications. To facilitate its use, I then develop compiler techniques to help automate the process of targeting the programmable prefetcher. These provide a variety of tradeoffs from easiest to use to best performance.

APA, Harvard, Vancouver, ISO, and other styles

10

Li, Wentong Kavi Krishna M. "High performance architecture using speculative threads and dynamic memory management hardware." [Denton, Tex.] : University of North Texas, 2007. http://digital.library.unt.edu/permalink/meta-dc-5150.

Full text

APA, Harvard, Vancouver, ISO, and other styles

11

Butler, Bryan P. (Bryan Philip). "A fault-tolerant shared memory system architecture for a Byzantine resilient computer." Thesis, Massachusetts Institute of Technology, 1989. http://hdl.handle.net/1721.1/13360.

Full text

Abstract:

Thesis (M.S.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1989.
Includes bibliographical references (leaves 145-147).
by Bryan P. Butler.
M.S.

APA, Harvard, Vancouver, ISO, and other styles

12

Beane, Glen L. "The Effects of Microprocessor Architecture on Speedup in Distrbuted Memory Supercomputers." Fogler Library, University of Maine, 2004. http://www.library.umaine.edu/theses/pdf/BeaneGL2004.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

13

Ballapuram, Chinnakrishnan S. "Semantics-oriented low power architecture." Diss., Atlanta, Ga. : Georgia Institute of Technology, 2008. http://hdl.handle.net/1853/22567.

Full text

Abstract:

Thesis (Ph. D.)--Electrical and Computer Engineering, Georgia Institute of Technology, 2008.
Committee Chair: Hsien-Hsin Sean Lee; Committee Member: Abhijit Chatterjee; Committee Member: Bernard Kippelen; Committee Member: Gabriel H. Loh; Committee Member: SungKyu Lim.

APA, Harvard, Vancouver, ISO, and other styles

14

Shim, Keun Sup. "Directoryless shared memory architecture using thread migration and remote access." Thesis, Massachusetts Institute of Technology, 2014. http://hdl.handle.net/1721.1/90001.

Full text

Abstract:

Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2014.
Cataloged from PDF version of thesis.
Includes bibliographical references (pages 101-106).
Chip multiprocessors (CMPs) have become mainstream in recent years, and, for scalability reasons, high-core-count designs tend towards tiled CMPs with physically distributed caches. In order to support shared memory, current many-core CMPs maintain cache coherence using distributed directory protocols, which are extremely difficult and error-prone to implement and verify. Private caches with directory-based coherence also provide suboptimal performance when a thread accesses large amounts of data distributed across the chip: the data must be brought to the core where the thread is running, incurring delays and energy costs. Under this scenario, migrating a thread to data instead of the other way around can improve performance. In this thesis, we propose a directoryless approach where data can be accessed either via a round-trip remote access protocol or by migrating a thread to where data resides. While our hardware mechanism for fine-grained thread migration enables faster migration than previous proposals, its costs still make it crucial to use thread migrations judiciously for the performance of our proposed architecture. We, therefore, present an on-line algorithm which decides at the instruction level whether to perform a remote access or a thread migration. In addition, to further reduce migration costs, we extend our scheme to support partial context migration by predicting the necessary thread context. Finally, we provide the ASIC implementation details as well as RTL simulation results of the Execution Migration Machine (EM² ), a 110-core directoryless shared-memory processor.
by Keun Sup Shim.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

15

Jakab, Levente 1981. "A shared-memory multiprocessor system using the raw tiled architecture." Thesis, Massachusetts Institute of Technology, 2004. http://hdl.handle.net/1721.1/28394.

Full text

Abstract:

Thesis (M. Eng. and S.B.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2004.
Includes bibliographical references (p. 203-208).
Recent trends in microprocessor design have moved away from dedicated hardware mechanisms to exposed architectures in which basic functionality is implemented in software. To demonstrate the flexibility of this scheme, I implement a shared memory system on Raw, a tiled multiprocessor. A traditional directory-based cache coherence system is used. The directories are fully resident on several tiles, and the remaining tiles act as users, with cache maintenance routines accessed via interrupt. Previous implementations of shared-memory systems have mostly relied on a combination of dedicated hardware and user-enabled software hooks, with one notable exception, Alewife, combining basic hardware with software support for corner cases. Here, the focus is moved to placing as much support for basic cases into software as possible. The system is designed to minimise custom hardware and user responsibilities. I prove the feasibility of such a design on an exposed architecture such as Raw.
by Levente Jakab.
M.Eng.and S.B.

APA, Harvard, Vancouver, ISO, and other styles

16

Janapsatya, Andhi Computer Science &amp Engineering Faculty of Engineering UNSW. "Optimization of instruction memory for embedded systems." Awarded by:University of New South Wales. School of Computer Science and Engineering, 2005. http://handle.unsw.edu.au/1959.4/24210.

Full text

Abstract:

This thesis presents methodologies for improving system performance and energy consumption by optimizing the memory hierarchy performance. The processor-memory performance gap is a well-known problem that is predicted to get worse, as the performance gap between processor and memory is widening. The author describes a method to estimate the best L1 cache configuration for a given application. In addition, three methods are presented to improve the performance and reduce energy in embedded systems by optimizing the instruction memory. Performance estimation is an important procedure to assess the performance of the system and to assess the effectiveness of any applied optimizations. A cache memory performance estimation methodology is presented in this thesis. The methodology is designed to quickly and accurately estimate the performance of multiple cache memory configurations. Experimental results showed that the methodology is on average 45 times faster compared to a widely used tool (Dinero IV). The first optimization method is a software-only method, called code placement, was implemented to improve the performance of instruction cache memory. The method involves careful placement of code within memory to ensure high cache hit rate when code is brought into the cache memory. Code placement methodology aims to improve cache hit rates to improve cache memory performance. Experimental results show that by applying the code placement method, a reduction in cache miss rate by up to 71%, and energy consumption reduction of up to 63% are observed when compared to application without code placement. The second method involves a novel architecture for utilizing scratchpad memory. The scratchpad memory is designed as a replacement of the instruction cache memory. Hardware modification was designed to allow data to be written into the scratchpad memory during program execution, allowing dynamic control of the scratchpad memory content. Scratchpad memory has a faster memory access time and a lower energy consumption per access compared to cache memory; the usage of scratchpad memory aims to improve performance and lower energy consumption of systems compared to system with cache memory. Experimental results show an average energy reduction of 26.59% and an average performance improvement of 25.63% when compared to a system with cache memory. The third is an application profiling method using statistical information to identify application???s hot-spots. Application profiling is important for identifying section in the application where performance degradation might occur and/or where maximum performance gain can be obtained through optimization. The method was applied and tested on the scratchpad based system described in this thesis. Experimental results show the effectiveness of the analysis method in reducing energy and improving performance when compared to previous method for utilizing the scratchpad memory based system (average performance improvement of 23.6% and average energy reduction of 27.1% are observed).

APA, Harvard, Vancouver, ISO, and other styles

17

Narravula, Harsha V. Katsinis Constantine. "Performance of parallel algorithms on a broadcast-based architecture /." Philadelphia, Pa. : Drexel University, 2003. http://dspace.library.drexel.edu/handle/1860/254.

Full text

APA, Harvard, Vancouver, ISO, and other styles

18

Li, Wentong. "High Performance Architecture using Speculative Threads and Dynamic Memory Management Hardware." Thesis, University of North Texas, 2007. https://digital.library.unt.edu/ark:/67531/metadc5150/.

Full text

Abstract:

With the advances in very large scale integration (VLSI) technology, hundreds of billions of transistors can be packed into a single chip. With the increased hardware budget, how to take advantage of available hardware resources becomes an important research area. Some researchers have shifted from control flow Von-Neumann architecture back to dataflow architecture again in order to explore scalable architectures leading to multi-core systems with several hundreds of processing elements. In this dissertation, I address how the performance of modern processing systems can be improved, while attempting to reduce hardware complexity and energy consumptions. My research described here tackles both central processing unit (CPU) performance and memory subsystem performance. More specifically I will describe my research related to the design of an innovative decoupled multithreaded architecture that can be used in multi-core processor implementations. I also address how memory management functions can be off-loaded from processing pipelines to further improve system performance and eliminate cache pollution caused by runtime management functions.

APA, Harvard, Vancouver, ISO, and other styles

19

Bani, Ruchi Rastogi Mohanty Saraju. "A new N-way reconfigurable data cache architecture for embedded systems." [Denton, Tex.] : University of North Texas, 2009. http://digital.library.unt.edu/ark:/67531/metadc12079.

Full text

APA, Harvard, Vancouver, ISO, and other styles

20

Rosello, Oscar (Rosello Gil). "NeverMind : an interface for human Memory augmentation." Thesis, Massachusetts Institute of Technology, 2017. http://hdl.handle.net/1721.1/111494.

Full text

Abstract:

Thesis: S.M. in Architecture Studies - Design and Computation, Massachusetts Institute of Technology, Department of Architecture, 2017.
Thesis: S.M., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2017.
Cataloged from PDF version of thesis. "June 2017."
Includes bibliographical references (pages [67]-70).
If we are to understand human-level intelligence, we need to understand how memories are encoded, stored and retrieved. In this thesis, I take a step towards that understanding by focusing on a high-level interpretation of the relationship between episodic memory formation and spatial navigation. On the basis of the biologically inspired process, I focus on the implementation of NeverMind, an augmented reality (AR) interface designed to help people memorize effectively. Early experiments conducted with a prototype of NeverMind suggest that the long-term memory recall accuracy of sequences of items is nearly tripled compared to paper-based memorization tasks. For this thesis, I suggest that we can trigger episodic memory for tasks that we normally associate with semantic memory, by using interfaces to passively stimulate the hippocampus, the entorhinal cortex, and the neocortex. Inspired by the methods currently used by memory champions, NeverMind facilitates memory encoding by engaging in hippocampal activation and promoting task-specific neural firing. NeverMind pairs spatial navigation with visual cues to make memorization tasks effective and enjoyable. The contributions of this thesis are twofold: first, I developed NeverMind, a tool to facilitate memorization through a single exposure by biasing our minds into using episodic memory. When studying, we tend to use semantic memory and encoding through repetition; however, by using augmented reality interfaces we can manipulate how our brain encodes information and memorize long term content with a single exposure, making a memory champion technique accessible to anyone. Second, I provide an open-source platform for researchers to conduct high-level experiments on episodic memory and spatial navigation. In this thesis I suggest that digital user interfaces can be used as a tool to gather insights on how human memory works.
by Oscar Rosello.
S.M. in Architecture Studies - Design and Computation
S.M.

APA, Harvard, Vancouver, ISO, and other styles

21

Ghosh, Mrinmoy. "Microarchitectural techniques to reduce energy consumption in the memory hierarchy." Diss., Atlanta, Ga. : Georgia Institute of Technology, 2009. http://hdl.handle.net/1853/28265.

Full text

Abstract:

Thesis (M. S.)--Electrical and Computer Engineering, Georgia Institute of Technology, 2009.
Committee Chair: Lee, Hsien-Hsin S.; Committee Member: Cahtterjee,Abhijit; Committee Member: Mukhopadhyay, Saibal; Committee Member: Pande, Santosh; Committee Member: Yalamanchili, Sudhakar.

APA, Harvard, Vancouver, ISO, and other styles

22

Gilgeous, Latoya Tabita. "An integrated software/hardware approach to detecting memory bounds violations." Diss., Online access via UMI:, 2007.

Find full text

Abstract:

Thesis (M.S.)--State University of New York at Binghamton, Thomas J. Watson School of Engineering and Applied Science, Department of Electrical and Computer Engineering, 2007.
Includes bibliographical references.

APA, Harvard, Vancouver, ISO, and other styles

23

Sim, Jae Woong. "Architecting heterogeneous memory systems with 3D die-stacked memory." Diss., Georgia Institute of Technology, 2015. http://hdl.handle.net/1853/53835.

Full text

Abstract:

The main objective of this research is to efficiently enable 3D die-stacked memory and heterogeneous memory systems. 3D die-stacking is an emerging technology that allows for large amounts of in-package high-bandwidth memory storage. Die-stacked memory has the potential to provide extraordinary performance and energy benefits for computing environments, from data-intensive to mobile computing. However, incorporating die-stacked memory into computing environments requires innovations across the system stack from hardware and software. This dissertation presents several architectural innovations to practically deploy die-stacked memory into a variety of computing systems. First, this dissertation proposes using die-stacked DRAM as a hardware-managed cache in a practical and efficient way. The proposed DRAM cache architecture employs two novel techniques: hit-miss speculation and self-balancing dispatch. The proposed techniques virtually eliminate the hardware overhead of maintaining a multi-megabytes SRAM structure, when scaling to gigabytes of stacked DRAM caches, and improve overall memory bandwidth utilization. Second, this dissertation proposes a DRAM cache organization that provides a high level of reliability for die-stacked DRAM caches in a cost-effective manner. The proposed DRAM cache uses error-correcting code (ECCs), strong checksums (CRCs), and dirty data duplication to detect and correct a wide range of stacked DRAM failures—from traditional bit errors to large-scale row, column, bank, and channel failures—within the constraints of commodity, non-ECC DRAM stacks. With only a modest performance degradation compared to a DRAM cache with no ECC support, the proposed organization can correct all single-bit failures, and 99.9993% of all row, column, and bank failures. Third, this dissertation proposes architectural mechanisms to use large, fast, on-chip memory structures as part of memory (PoM) seamlessly through the hardware. The proposed design achieves the performance benefit of on-chip memory caches without sacrificing a large fraction of total memory capacity to serve as a cache. To achieve this, PoM implements the ability to dynamically remap regions of memory based on their access patterns and expected performance benefits. Lastly, this dissertation explores a new usage model for die-stacked DRAM involving a hybrid of caching and virtual memory support. In the common case where system’s physical memory is not over-committed, die-stacked DRAM operates as a cache to provide performance and energy benefits to the system. However, when the workload’s active memory demands exceed the capacity of the physical memory, the proposed scheme dynamically converts the stacked DRAM cache into a fast swap device to avoid the otherwise grievous performance penalty of swapping to disk.

APA, Harvard, Vancouver, ISO, and other styles

24

Bani, Ruchi Rastogi. "A New N-way Reconfigurable Data Cache Architecture for Embedded Systems." Thesis, University of North Texas, 2009. https://digital.library.unt.edu/ark:/67531/metadc12079/.

Full text

Abstract:

Performance and power consumption are most important issues while designing embedded systems. Several studies have shown that cache memory consumes about 50% of the total power in these systems. Thus, the architecture of the cache governs both performance and power usage of embedded systems. A new N-way reconfigurable data cache is proposed especially for embedded systems. This thesis explores the issues and design considerations involved in designing a reconfigurable cache. The proposed reconfigurable data cache architecture can be configured as direct-mapped, two-way, or four-way set associative using a mode selector. The module has been designed and simulated in Xilinx ISE 9.1i and ModelSim SE 6.3e using the Verilog hardware description language.

APA, Harvard, Vancouver, ISO, and other styles

25

Panwar, Gagandeep. "Towards Using Free Memory to Improve Microarchitecture Performance." Thesis, Virginia Tech, 2020. http://hdl.handle.net/10919/98745.

Full text

Abstract:

A computer system's memory is designed to accommodate the worst-case workloads with the highest memory requirement; as such, memory is underutilized when a system runs workloads with common-case memory requirements. Through a large-scale study of four production HPC systems, we find that memory underutilization problem in HPC systems is very severe. As unused memory is wasted memory, we propose exposing a compute node's unused memory to its CPU(s) through a user-transparent CPU-OS codesign. This can enable many new microarchitecture techniques that transparently leverage unused memory locations to help improve microarchitecture performance. We refer to these techniques as Free-memory-aware Microarchitecture Techniques (FMTs). In the context of HPC systems, we present a detailed example of an FMT called Free-memory-aware Replication (FMR). FMR replicates in-use data to unused memory locations to effectively reduce average memory read latency. On average across five HPC benchmark suites, FMR provides 13% performance and 8% system-level energy improvement.
M.S.
Random-access memory (RAM) or simply memory, stores the temporary data of applications that run on a computer system. Its size is determined by the worst-case application workload that the computer system is supposed to run. Through our memory utilization study of four large multi-node high-performance computing (HPC) systems, we find that memory is underutilized severely in these systems. Unused memory is a wasted resource that does nothing. In this work, we propose techniques that can make use of this wasted memory to boost computer system performance. We call these techniques Free-memory-aware Microarchitecture Techniques (FMTs). We then present an FMT for HPC systems in detail called Free-memory-aware Replication (FMR) that provides performance improvement of over 13%.

APA, Harvard, Vancouver, ISO, and other styles

26

Sorenson, Elizabeth S. "Cache characterization and performance studies using locality surfaces /." Diss., CLICK HERE for online access, 2005. http://contentdm.lib.byu.edu/ETD/image/etd950.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

27

Kim, Donglok. "Extended data cache prefetching using a reference prediction table /." Thesis, Connect to this title online; UW restricted, 1997. http://hdl.handle.net/1773/6127.

Full text

APA, Harvard, Vancouver, ISO, and other styles

28

Harvard, Qawi IbnZayd. "Wide I/O DRAM architecture utilizing proximity communication." [Boise, Idaho] : Boise State University, 2009. http://scholarworks.boisestate.edu/td/72/.

Full text

APA, Harvard, Vancouver, ISO, and other styles

29

Kjelso, Morten. "A quantitative evaluation of data compression in the memory hierarchy." Thesis, Loughborough University, 1997. https://dspace.lboro.ac.uk/2134/10596.

Full text

Abstract:

This thesis explores the use of lossless data compression in the memory hierarchy of contemporary computer systems. Data compression may realise performance benefits by increasing the capacity of a level in the memory hierarchy and by improving the bandwidth between two levels in the memory hierarchy. Lossless data compression is already widely used in parts ofthe memory hierarchy. However, most of these applications are characterised by targeting inexpensive and relatively low performance devices such as magnetic disk and tape devices. The consequences of this are that the benefits of data compression are not realised to their full potential. This research aims to understand how the benefits of data compression can be realised for levels of the memory hierarchy which have a greater impact on system performance and system cost. This thesis presents a review of data compression in the memory hierarchy and argues that main memory compression has the greatest potential to improve system performance. The review also identifies three key issues relating to the use of data compression in the memory hierarchy. Quantitative investigations are presented to address these issues for main memory data compression. The first investigation is into memory data, and shows that memory data from a range of Unix applications typically compresses to half its original size. The second investigation develops three memory compression architectures, taking into account the results of the previous investigation. Furthermore, the management of compressed data is addressed and management methods are developed which achieve storage efficiencies in excess of 90% and typically complete allocation and de allocation operations with only a few memory accesses. The experimental work then culminates in a performance investigation. This shows that when memory resources are strecthed, hardware based memory compression can improve system performance by up to an order of magnitude. Furthermore, software based memory compression can improve system performance by up to a factor of 2. Finally, the performance models and quantitative results contained in this thesis enable us to identify under what conditions memory compression offers performance benefits. This may help designers incorporate memory compression into future computer systems.

APA, Harvard, Vancouver, ISO, and other styles

30

Russ, Samuel H. "An information-theoretic approach to analysis of computer architectures and compression of instruction memory usage." Diss., Georgia Institute of Technology, 1991. http://hdl.handle.net/1853/13357.

Full text

APA, Harvard, Vancouver, ISO, and other styles

31

Beg, Azam Muhammad. "Improving instruction fetch rate with code pattern cache for superscalar architecture." Diss., Mississippi State : Mississippi State University, 2005. http://library.msstate.edu/etd/show.asp?etd=etd-06202005-103032.

Full text

APA, Harvard, Vancouver, ISO, and other styles

32

Zhang, Xiushan. "L2 cache replacement based on inter-access time per access count prediction." Diss., Online access via UMI:, 2009.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

33

Gieske, Edmund Joseph. "Critical Words Cache Memory." University of Cincinnati / OhioLINK, 2008. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1208368190.

Full text

APA, Harvard, Vancouver, ISO, and other styles

34

Chowdhury, Mohammad Ataur Rahman. "Rethinking the I/O Stack for Persistent Memory." FIU Digital Commons, 2018. https://digitalcommons.fiu.edu/etd/3572.

Full text

Abstract:

Modern operating systems have been designed around the hypotheses that (a) memory is both byte-addressable and volatile and (b) storage is block addressable and persistent. The arrival of new Persistent Memory (PM) technologies, has made these assumptions obsolete. Despite much of the recent work in this space, the need for consistently sharing PM data across multiple applications remains an urgent, unsolved problem. Furthermore, the availability of simple yet powerful operating system support remains elusive. In this dissertation, we propose and build The Region System – a high-performance operating system stack for PM that implements usable consistency and persistence for application data. The region system provides support for consistently mapping and sharing data resident in PM across user application address spaces. The region system creates a novel IPI based PMSYNC operation, which ensures atomic persistence of mapped pages across multiple address spaces. This allows applications to consume PM using the well understood and much desired memory like model with an easy-to-use interface. Next, we propose a metadata structure without any redundant metadata to reduce CPU cache flushes. The high-performance design minimizes the expensive PM ordering and durability operations by embracing a minimalistic approach to metadata construction and management. To strengthen the case for the region system, in this dissertation, we analyze different types of applications to identify their dependence on memory mapped data usage, and propose user level libraries LIBPM-R and LIBPMEMOBJ-R to support shared persistent containers. The user level libraries along with the region system demonstrate a comprehensive end-to-end software stack for consuming the PM devices.

APA, Harvard, Vancouver, ISO, and other styles

35

Herath, Herath Mudiyanselage Isuru Prasenajith. "Data centric and adaptive source changing transactional memory with exit functionality." Thesis, University of Manchester, 2012. https://www.research.manchester.ac.uk/portal/en/theses/data-centric-and-adaptive-source-changing-transactional-memory-with-exit-functionality(b304862a-3c88-4d79-9b5f-0f7b99b7cebc).html.

Full text

Abstract:

Multi-core computing is becoming ubiquitous due to the scaling limitations of single-core computing. It is inevitable that parallel programming will become the mainstream for such processors. In this paradigm shift, the concept of abstraction should not be compromised. A programming model serves as an abstraction of how programs are executed. Transactional Memory (TM) is a technique proposed to maintain lock free synchronization. Due to the simplicity of the abstraction provided by it, TM can also be used as a way of distributing parallel work, maintaining coherence and consistency. Motivated by this, at a higher level, the thesis makes three contributions and all are centred around Hardware Transactional Memory (HTM).As the first contribution, a transaction-only architecture is coupled with a ``data centric" approach, to address the scalability issues of the former whilst maintaining its simplicity. This is achieved by grouping together memory locations having similar access patterns and maintaining coherence and consistency according to the group each memory location belongs to. As the second contribution a novel technique is proposed to reduce the number of false transaction aborts which occur in a signature based HTM. The idea is to adaptively switch between cache lines and signatures to detect conflicts. That is, when a transaction fits in the L1 cache, cache line information is used to detect conflicts and signatures are used otherwise. As the third contribution, the thesis makes a case for having an exit functionality in an HTM. The objective of the proposed functionality, TM_EXIT, is to terminate a transaction without restarting or committing.

APA, Harvard, Vancouver, ISO, and other styles

36

Nayyar, Raman. "Performance Analysis of a Hierarchical, Cache-Coherent, Shared Memory Based, Multi-processor System." PDXScholar, 1993. https://pdxscholar.library.pdx.edu/open_access_etds/4695.

Full text

Abstract:

We have conducted a performance analysis of a large scale multiprocessor system based on shared buses organized in a hierarchical fashion and employing an easy to implement snoopy cache protocol. · This arrangement, named TREEBUS [ 5], presents a logical extension path for multiprocessor systems based on a single shared bus whose scalability is limited by the available system bus bandwidth [26]. The multiple, independent, hierarchical buses overcome the bus bandwidth limitation and the architecture can scale to relatively large sizes. We have developed an easy to use, reasonably accurate and computationally efficient analytic model for analyzing the performance of the memory hierarchy. Our analysis presents a balanced view by incorporating cost and size of the memory subsystem, two parameters which can significantly impact the feasibility of this architecture. The results indicate that the TREEBUS can deliver high performance for a maximum of about 512 processors using available technology. For larger sizes, the problem is not the limited system bus bandwidth but the unmanageable size of the main memory and a deteriorating cost/performance ratio.

APA, Harvard, Vancouver, ISO, and other styles

37

Subramaniam, Samantika. "Improving processor efficiency by exploiting common-case behaviors of memory instructions." Diss., Atlanta, Ga. : Georgia Institute of Technology, 2009. http://hdl.handle.net/1853/28165.

Full text

Abstract:

Thesis (M. S.)--Computing, Georgia Institute of Technology, 2009.
Committee Chair: Loh, Gabriel H.; Committee Member: Clark, Nathan; Committee Member: Jaleel, Aamer; Committee Member: Kim, Hyesoon; Committee Member: Lee, Hsien-Hsin S.; Committee Member: Prvulovic, Milos.

APA, Harvard, Vancouver, ISO, and other styles

38

SOHONI, SOHUM. "IMPROVING L2 CACHE PERFORMANCE THROUGH STREAM-DIRECTED OPTIMIZATIONS." University of Cincinnati / OhioLINK, 2004. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1092932892.

Full text

APA, Harvard, Vancouver, ISO, and other styles

39

Hafner, William L. "Requirements and Architecture for a Group Memory in the Any-Time/Any-Place Domain of Computer Supported Cooperative Work." NSUWorks, 1999. http://nsuworks.nova.edu/gscis_etd/560.

Full text

Abstract:

When workers are separated in time and space, but have to cooperate in solving a business problem, there is a need for computer-based support to assist the group in remembering group decisions, past activities, current assignments and on-going conversations about the work in progress. Computer technology can provide a repository that allows group members to record and retrieve this type of information from a central system. This dissertation has focused on the system requirements for this type of group memory in the Any-Time/Any-Place domain of computer-supported cooperative work. A set of requirements for a central system to support distributed group work has been developed. The consolidated requirements have been documented in a requirements specification. This specification has refined the system designs presented in the published literature. The requirements developed were used to produce a software reference architecture for an Any-Time/Any-Place group memory system. The reference architecture that resulted from the consolidated requirements was validated through comparison with representative implementations documented in the published literature of cooperative work. Functional mappings from the reference architecture to the validation instantiations have shown that the reference architecture represents the class of systems that support group memory.

APA, Harvard, Vancouver, ISO, and other styles

40

Elver, Marco Iskender. "Memory consistency directed cache coherence protocols for scalable multiprocessors." Thesis, University of Edinburgh, 2016. http://hdl.handle.net/1842/22073.

Full text

Abstract:

The memory consistency model, which formally specifies the behavior of the memory system, is used by programmers to reason about parallel programs. From a hardware design perspective, weaker consistency models permit various optimizations in a multiprocessor system: this thesis focuses on designing and optimizing the cache coherence protocol for a given target memory consistency model. Traditional directory coherence protocols are designed to be compatible with the strictest memory consistency model, sequential consistency (SC). When they are used for chip multiprocessors (CMPs) that provide more relaxed memory consistency models, such protocols turn out to be unnecessarily strict. Usually, this comes at the cost of scalability, in terms of per-core storage due to sharer tracking, which poses a problem with increasing number of cores in today’s CMPs, most of which no longer are sequentially consistent. The recent convergence towards programming language based relaxed memory consistency models has sparked renewed interest in lazy cache coherence protocols. These protocols exploit synchronization information by enforcing coherence only at synchronization boundaries via self-invalidation. As a result, such protocols do not require sharer tracking which benefits scalability. On the downside, such protocols are only readily applicable to a restricted set of consistency models, such as Release Consistency (RC), which expose synchronization information explicitly. In particular, existing architectures with stricter consistency models (such as x86) cannot readily make use of lazy coherence protocols without either: adapting the protocol to satisfy the stricter consistency model; or changing the architecture’s consistency model to (a variant of) RC, typically at the expense of backward compatibility. The first part of this thesis explores both these options, with a focus on a practical approach satisfying backward compatibility. Because of the wide adoption of Total Store Order (TSO) and its variants in x86 and SPARC processors, and existing parallel programs written for these architectures, we first propose TSO-CC, a lazy cache coherence protocol for the TSO memory consistency model. TSO-CC does not track sharers and instead relies on self-invalidation and detection of potential acquires (in the absence of explicit synchronization) using per cache line timestamps to efficiently and lazily satisfy the TSO memory consistency model. Our results show that TSO-CC achieves, on average, performance comparable to a MESI directory protocol, while TSO-CC’s storage overhead per cache line scales logarithmically with increasing core count. Next, we propose an approach for the x86-64 architecture, which is a compromise between retaining the original consistency model and using a more storage efficient lazy coherence protocol. First, we propose a mechanism to convey synchronization information via a simple ISA extension, while retaining backward compatibility with legacy codes and older microarchitectures. Second, we propose RC3 (based on TSOCC), a scalable cache coherence protocol for RCtso, the resulting memory consistency model. RC3 does not track sharers and relies on self-invalidation on acquires. To satisfy RCtso efficiently, the protocol reduces self-invalidations transitively using per-L1 timestamps only. RC3 outperforms a conventional lazy RC protocol by 12%, achieving performance comparable to a MESI directory protocol for RC optimized programs. RC3’s storage overhead per cache line scales logarithmically with increasing core count and reduces on-chip coherence storage overheads by 45% compared to TSO-CC. Finally, it is imperative that hardware adheres to the promised memory consistency model. Indeed, consistency directed coherence protocols cannot use conventional coherence definitions (e.g. SWMR) to be verified against, and few existing verification methodologies apply. Furthermore, as the full consistency model is used as a specification, their interaction with other components (e.g. pipeline) of a system must not be neglected in the verification process. Therefore, verifying a system with such protocols in the context of interacting components is even more important than before. One common way to do this is via executing tests, where specific threads of instruction sequences are generated and their executions are checked for adherence to the consistency model. It would be extremely beneficial to execute such tests under simulation, i.e. when the functional design implementation of the hardware is being prototyped. Most prior verification methodologies, however, target post-silicon environments, which when used for simulation-based memory consistency verification would be too slow. We propose McVerSi, a test generation framework for fast memory consistency verification of a full-system design implementation under simulation. Our primary contribution is a Genetic Programming (GP) based approach to memory consistency test generation, which relies on a novel crossover function that prioritizes memory operations contributing to non-determinism, thereby increasing the probability of uncovering memory consistency bugs. To guide tests towards exercising as much logic as possible, the simulator’s reported coverage is used as the fitness function. Furthermore, we increase test throughput by making the test workload simulation-aware. We evaluate our proposed framework using the Gem5 cycle accurate simulator in full-system mode with Ruby (with configurations that use Gem5’s MESI protocol, and our proposed TSO-CC together with an out-of-order pipeline). We discover 2 new bugs in the MESI protocol due to the faulty interaction of the pipeline and the cache coherence protocol, highlighting that even conventional protocols should be verified rigorously in the context of a full-system. Crucially, these bugs would not have been discovered through individual verification of the pipeline or the coherence protocol. We study 11 bugs in total. Our GP-based test generation approach finds all bugs consistently, therefore providing much higher guarantees compared to alternative approaches (pseudo-random test generation and litmus tests).

APA, Harvard, Vancouver, ISO, and other styles

41

Doudalis, Ioannis. "Hardware assisted memory checkpointing and applications in debugging and reliability." Diss., Georgia Institute of Technology, 2011. http://hdl.handle.net/1853/42700.

Full text

Abstract:

The problems of software debugging and system reliability/availability are among the most challenging problems the computing industry is facing today, with direct impact on the development and operating costs of computing systems. A promising debugging technique that assists programmers identify and fix the causes of software bugs a lot more efficiently is bidirectional debugging, which enables the user to execute the program in "reverse", and a typical method used to recover a system after a fault is backwards error recovery, which restores the system to the last error-free state. Both reverse execution and backwards error recovery are enabled by creating memory checkpoints, which are used to restore the program/system to a prior point in time and re-execute until the point of interest. The checkpointing frequency is the primary factor that affects both the latency of reverse execution and the recovery time of the system; more frequent checkpoints reduce the necessary re-execution time. Frequent creation of checkpoints poses performance challenges, because of the increased number of memory reads and writes necessary for copying the modified system/program memory, and also because of software interventions, additional synchronization and I/O, etc., needed for creating a checkpoint. In this thesis I examine a number of different hardware accelerators, whose role is to create frequent memory checkpoints in the background, at minimal performance overheads. For the purpose of reverse execution, I propose the HARE and Euripus hardware checkpoint accelerators. HARE and Euripus create different types of checkpoints, and employ different methods for keeping track of the modified memory. As a result, HARE and Euripus have different hardware costs and provide different functionality which directly affects the latency of reverse execution. For improving the availability of the system, I propose the Kyma hardware accelerator. Kyma enables simultaneous creation of checkpoints at different frequencies, which allows the system to recover from multiple types of errors and tolerate variable error-detection latencies. The Kyma and Euripus hardware engines have similar architectures, but the functionality of the Kyma engine is optimized for further reducing the performance overheads and improving the reliability of the system. The functionality of the Kyma and Euripus engines can be combined into a unified accelerator that can serve the needs of both bidirectional debugging and system recovery.

APA, Harvard, Vancouver, ISO, and other styles

42

Yan, Chenyu. "Architectural support for improving security and performance of memory sub-systems." Diss., Atlanta, Ga. : Georgia Institute of Technology, 2008. http://hdl.handle.net/1853/26663.

Full text

Abstract:

Thesis (Ph.D)--Computing, Georgia Institute of Technology, 2009.
Committee Chair: Milos Prvulovic; Committee Member: Gabriel Loh; Committee Member: Hyesoon Kim; Committee Member: Umakishore Ramachandran; Committee Member: Yan Solihin. Part of the SMARTech Electronic Thesis and Dissertation Collection.

APA, Harvard, Vancouver, ISO, and other styles

43

Lodde, Mario. "Smart Memory and Network-On-Chip Design for High-Performance Shared-Memory Chip Multiprocessors." Doctoral thesis, Universitat Politècnica de València, 2014. http://hdl.handle.net/10251/35325.

Full text

Abstract:

La jerarquía de caches y la red en el chip (NoC) son dos componentes clave de los chip multiprocesadores (CMPs). La mayoría del trafico en la NoC se debe a mensajes que las caches envían según lo que establece el protocolo de coherencia. La cantidad de trafico, el porcentaje de mensajes cortos y largos y el patrón de trafico en general varían dependiendo de la geometría de las caches y del protocolo de coherencia. La arquitectura de la NoC y la jerarquía de caches están de hecho firmemente acopladas, y estos dos componentes deben ser diseñados y evaluados conjuntamente para estudiar como el variar uno afecta a las prestaciones del otro. Además, cada componente debe ajustarse a los requisitos y a las oportunidades del otro, y al revés. Normalmente diferentes clases de mensajes se envían por diferentes redes virtuales o por NoCs con diferente ancho de banda, separando mensajes largos y cortos. Sin embargo, otra clasificación de los mensajes se puede hacer dependiendo del tipo de información que proveen: algunos mensajes, como las peticiones de datos, necesitan campos para almacenar información (dirección del bloque, tipo de petición, etc.); otros, como los mensajes de reconocimiento (ACK), no proporcionan ninguna información excepto por el ID del nodo destino: solo proveen una información de tipo temporal, en el sentido que la recepción de un ACK indica que el nodo fuente ha recibido el mensaje al que está contestando con el ACK y completado todas las operaciones determinadas por el protocolo de coherencia. Esta segunda clase de mensaje no necesita de mucho ancho de banda: la latencia es mucho mas importante, dado que el nodo destino esta típicamente bloqueado esperando la recepción de ellos. En este trabajo de tesis se desarrolla una red dedicada para trasmitir la segunda clase de mensajes; la red es muy sencilla y rápida, y permite la entrega de los ACKs con una latencia de pocos ciclos de reloj. Reduciendo la latencia y el trafico en la NoC debido a los ACKs, es posible: -acelerar la fase de invalidación en fase de escritura en un sistema que usa un protocolo de coherencia basado en directorios -mejorar las prestaciones de un protocolo de coerencia basado en broadcast, hasta llegar a prestaciones comparables con las de un protocolo de directorios pero sin el coste de área debido a la necesidad de almacenar el directorio -implementar un mapeado dinámico de bloques a las caches de ultimo nivel de forma eficiente, con el objetivo de acercar cuanto al máximo los bloques a los cores que los utilizan El objetivo final es obtener un co-diseño de NoC y jerarquía de caches que minimice los problemas de escalabilidad de los protocolos de coherencia. Como gran objetivo final, se pretende la implementación de un CMP con ubicación dinámica de los recursos de cache y red, tal que estos recursos se puedan particionar de forma eficiente e independiente para asignar diferentes particiones a diferentes aplicaciones en un entorno virtualizado.
Lodde, M. (2014). Smart Memory and Network-On-Chip Design for High-Performance Shared-Memory Chip Multiprocessors [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/35325
TESIS

APA, Harvard, Vancouver, ISO, and other styles

44

Shelor, Charles F. "Dataflow Processing in Memory Achieves Significant Energy Efficiency." Thesis, University of North Texas, 2018. https://digital.library.unt.edu/ark:/67531/metadc1248478/.

Full text

Abstract:

The large difference between processor CPU cycle time and memory access time, often referred to as the memory wall, severely limits the performance of streaming applications. Some data centers have shown servers being idle three out of four clocks. High performance instruction sequenced systems are not energy efficient. The execute stage of even simple pipeline processors only use 9% of the pipeline's total energy. A hybrid dataflow system within a memory module is shown to have 7.2 times the performance with 368 times better energy efficiency than an Intel Xeon server processor on the analyzed benchmarks. The dataflow implementation exploits the inherent parallelism and pipelining of the application to improve performance without the overhead functions of caching, instruction fetch, instruction decode, instruction scheduling, reorder buffers, and speculative execution used by high performance out-of-order processors. Coarse grain reconfigurable logic in an energy efficient silicon process provides flexibility to implement multiple algorithms in a low energy solution. Integrating the logic within a 3D stacked memory module provides lower latency and higher bandwidth access to memory while operating independently from the host system processor.

APA, Harvard, Vancouver, ISO, and other styles

45

Radovic, Zoran. "Software Techniques for Distributed Shared Memory." Doctoral thesis, Uppsala University, Department of Information Technology, 2005. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-6058.

Full text

Abstract:

In large multiprocessors, the access to shared memory is often nonuniform, and may vary as much as ten times for some distributed shared-memory architectures (DSMs). This dissertation identifies another important nonuniform property of DSM systems: nonuniform communication architecture, NUCA. High-end hardware-coherent machines built from large nodes, or from chip multiprocessors, are typical NUCA systems, since they have a lower penalty for reading recently written data from a neighbor's cache than from a remote cache. This dissertation identifies node affinity as an important property for scalable general-purpose locks. Several software-based hierarchical lock implementations exploiting NUCAs are presented and evaluated. NUCA-aware locks are shown to be almost twice as efficient for contended critical sections compared to traditional lock implementations.

The shared-memory “illusion”' provided by some large DSM systems may be implemented using either hardware, software or a combination thereof. A software-based implementation can enable cheap cluster hardware to be used, but typically suffers from poor and unpredictable performance characteristics.

This dissertation advocates a new software-hardware trade-off design point based on a new combination of techniques. The two low-level techniques, fine-grain deterministic coherence and synchronous protocol execution, as well as profile-guided protocol flexibility, are evaluated in isolation as well as in a combined setting using all-software implementations. Finally, a minimum of hardware trap support is suggested to further improve the performance of coherence protocols across cluster nodes. It is shown that all these techniques combined could result in a fairly stable performance on par with hardware-based coherence.

APA, Harvard, Vancouver, ISO, and other styles

46

Sparks, Matthew A. "A COMPREHENSIVE HDL MODEL OF A LINE ASSOCIATIVE REGISTER BASED ARCHITECTURE." UKnowledge, 2013. http://uknowledge.uky.edu/ece_etds/26.

Full text

Abstract:

Modern processor architectures suffer from an ever increasing gap between processor and memory performance. The current memory-register model attempts to hide this gap by a system of cache memory. Line Associative Registers(LARs) are proposed as a new system to avoid the memory gap by pre-fetching and associative updating of both instructions and data. This thesis presents a fully LAR-based architecture, targeting a previously developed instruction set architecture. This architecture features an execution pipeline supporting SWAR operations, and a memory system supporting the associative behavior of LARs and lazy writeback to memory.

APA, Harvard, Vancouver, ISO, and other styles

47

Holsapple, Stephen Alan. "DSM64: A DISTRIBUTED SHARED MEMORY SYSTEM IN USER-SPACE." DigitalCommons@CalPoly, 2012. https://digitalcommons.calpoly.edu/theses/725.

Full text

Abstract:

This paper presents DSM64: a lazy release consistent software distributed shared memory (SDSM) system built entirely in user-space. The DSM64 system is capable of executing threaded applications implemented with pthreads on a cluster of networked machines without any modifications to the target application. The DSM64 system features a centralized memory manager [1] built atop Hoard [2, 3]: a fast, scalable, and memory-efficient allocator for shared-memory multiprocessors. In my presentation, I present a SDSM system written in C++ for Linux operating systems. I discuss a straight-forward approach to implement SDSM systems in a Linux environment using system-provided tools and concepts avail- able entirely in user-space. I show that the SDSM system presented in this paper is capable of resolving page faults over a local area network in as little as 2 milliseconds. In my analysis, I present the following. I compare the performance characteristics of a matrix multiplication benchmark using various memory coherency models. I demonstrate that matrix multiplication benchmark using a LRC model performs orders of magnitude quicker than the same application using a stricter coherency model. I show the effect of coherency model on memory access patterns and memory contention. I compare the effects of different locking strategies on execution speed and memory access patterns. Lastly, I provide a comparison of the DSM64 system to a non-networked version using a system-provided allocator.

APA, Harvard, Vancouver, ISO, and other styles

48

Chen, Zhi. "Power-Efficient and Low-Latency Memory Access for CMP Systems with Heterogeneous Scratchpad On-Chip Memory." UKnowledge, 2013. http://uknowledge.uky.edu/ece_etds/25.

Full text

Abstract:

The gradually widening speed disparity of between CPU and memory has become an overwhelming bottleneck for the development of Chip Multiprocessor (CMP) systems. In addition, increasing penalties caused by frequent on-chip memory accesses have raised critical challenges in delivering high memory access performance with tight power and latency budgets. To overcome the daunting memory wall and energy wall issues, this thesis focuses on proposing a new heterogeneous scratchpad memory architecture which is configured from SRAM, MRAM, and Z-RAM. Based on this architecture, we propose two algorithms, a dynamic programming and a genetic algorithm, to perform data allocation to different memory units, therefore reducing memory access cost in terms of power consumption and latency. Extensive and intensive experiments are performed to show the merits of the heterogeneous scratchpad architecture over the traditional pure memory system and the effectiveness of the proposed algorithms.

APA, Harvard, Vancouver, ISO, and other styles

49

Giordano, Omar. "Design and Implementation of an Architecture-aware In-memory Key- Value Store." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-291213.

Full text

Abstract:

Key-Value Stores (KVSs) are a type of non-relational databases whose data is represented as a key-value pair and are often used to represent cache and session data storage. Among them, Memcached is one of the most popular ones, as it is widely used in various Internet services such as social networks and streaming platforms. Given the continuous and increasingly rapid growth of networked devices that use these services, the commodity hardware on which the databases are based must process packets faster to meet the needs of the market. However, in recent years, the performance improvements characterising the new hardware has become thinner and thinner. From here, as the purchase of new products is no longer synonymous with significant performance improvements, companies need to exploit the full potential of the hardware already in their possession, consequently postponing the purchase of more recent hardware. One of the latest ideas for increasing the performance of commodity hardware is the use of slice-aware memory management. This technique exploits the Last Level of Cache (LLC) by making sure that the individual cores take data from memory locations that are mapped to their respective cache portions (i.e., LLC slices). This thesis focuses on the realisation of a KVS prototype—based on Intel Haswell micro-architecture—built on top of the Data Plane Development Kit (DPDK), and to which the principles of slice-aware memory management are applied. To test its performance, given the non-existence of a DPDKbased traffic generator that supports the Memcached protocol, an additional prototype of a traffic generator that supports these features has also been developed. The performances were measured using two distinct machines: one for the traffic generator and one for the KVS. First, the “regular” KVS prototype was tested, then, to see the actual benefits, the slice-aware one. Both KVS prototypeswere subjected to two types of traffic: (i) uniformtraffic where the keys are always different from each other, and (ii) skewed traffic, where keys are repeated and some keys are more likely to be repeated than others. The experiments show that, in real-world scenario (i.e., characterised by skewed key distributions), the employment of a slice-aware memory management technique in a KVS can slightly improve the end-to-end latency (i.e.,~2%). Additionally, such technique highly impacts the look-up time required by the CPU to find the key and the corresponding value in the database, decreasing the mean time by ~22.5%, and improving the 99th percentile by ~62.7%.
Key-Value Stores (KVSs) är en typ av icke-relationsdatabaser vars data representeras som ett nyckel-värdepar och används ofta för att representera lagring av cache och session. Bland dem är Memcached en av de mest populära, eftersom den används ofta i olika internettjänster som sociala nätverk och strömmande plattformar. Med tanke på den kontinuerliga och allt snabbare tillväxten av nätverksenheter som använder dessa tjänster måste den råvaruhårdvara som databaserna bygger på bearbeta paket snabbare för att möta marknadens behov. Under de senaste åren har dock prestandaförbättringarna som kännetecknar den nya hårdvaran blivit tunnare och tunnare. Härifrån, eftersom inköp av nya produkter inte längre är synonymt med betydande prestandaförbättringar, måste företagen utnyttja den fulla potentialen för hårdvaran som redan finns i deras besittning, vilket skjuter upp köpet av nyare hårdvara. En av de senaste idéerna för att öka prestanda för råvaruhårdvara är användningen av skivmedveten minneshantering. Denna teknik utnyttjar den Sista Nivån av Cache (SNC) genom att se till att de enskilda kärnorna tar data från minnesplatser som är mappade till deras respektive cachepartier (dvs. SNCskivor). Denna avhandling fokuserar på förverkligandet av en KVS-prototyp— baserad på Intel Haswell mikroarkitektur—byggd ovanpå Data Plane Development Kit (DPDK), och på vilken principerna för skivmedveten minneshantering tillämpas. För att testa dess prestanda, med tanke på att det inte finns en DPDK-baserad trafikgenerator som stöder Memcachedprotokollet, har en ytterligare prototyp av en trafikgenerator som stöder dessa funktioner också utvecklats. Föreställningarna mättes med två olika maskiner: en för trafikgeneratorn och en för KVS. Först testades den “vanliga” KVSprototypen, för att se de faktiska fördelarna, den skivmedvetna. Båda KVSprototyperna utsattes för två typer av trafik: (i) enhetlig trafik där nycklarna alltid skiljer sig från varandra och (ii) sned trafik, där nycklar upprepas och vissa nycklar är mer benägna att upprepas än andra. Experimenten visar att i verkliga scenarier (dvs. kännetecknas av snedställda nyckelfördelningar) kan användningen av en skivmedveten minneshanteringsteknik i en KVS förbättra förbättringen från slut till slut (dvs. ~2%). Dessutom påverkar sådan teknik i hög grad uppslagstiden som krävs av CPU: n för att hitta nyckeln och motsvarande värde i databasen, vilket minskar medeltiden med ~22, 5% och förbättrar 99th percentilen med ~62, 7%.

APA, Harvard, Vancouver, ISO, and other styles

50

Young, Jeffrey. "Dynamic partitioned global address spaces for high-efficiency computing." Thesis, Atlanta, Ga. : Georgia Institute of Technology, 2008. http://hdl.handle.net/1853/26467.

Full text

Abstract:

Thesis (M. S.)--Electrical and Computer Engineering, Georgia Institute of Technology, 2009.
Committee Chair: Yalamanchili, Sudhakar; Committee Member: Riley, George; Committee Member: Schimmel, David. Part of the SMARTech Electronic Thesis and Dissertation Collection.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Computer Memory Architecture'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles