Log in

Relevant bibliographies by topics / Computer Memory Architecture / Journal articles

To see the other types of publications on this topic, follow the link: Computer Memory Architecture.

Journal articles on the topic 'Computer Memory Architecture'

Author: Grafiati

Published: 4 June 2021

Last updated: 7 September 2023

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Computer Memory Architecture.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Choi, Yongseok, Eunji Lim, Jaekwon Shin, and Cheol-Hoon Lee. "MemBox: Shared Memory Device for Memory-Centric Computing Applicable to Deep Learning Problems." Electronics 10, no. 21 (November 8, 2021): 2720. http://dx.doi.org/10.3390/electronics10212720.

Full text

Abstract:

Large-scale computational problems that need to be addressed in modern computers, such as deep learning or big data analysis, cannot be solved in a single computer, but can be solved with distributed computer systems. Since most distributed computing systems, consisting of a large number of networked computers, should propagate their computational results to each other, they can suffer the problem of an increasing overhead, resulting in lower computational efficiencies. To solve these problems, we proposed an architecture of a distributed system that used a shared memory that is simultaneously accessible by multiple computers. Our architecture aimed to be implemented in FPGA or ASIC. Using an FPGA board that implemented our architecture, we configured the actual distributed system and showed the feasibility of our system. We compared the results of the deep learning application test using our architecture with that using Google Tensorflow’s parameter server mechanism. We showed improvements in our architecture beyond Google Tensorflow’s parameter server mechanism and we determined the future direction of research by deriving the expected problems.

APA, Harvard, Vancouver, ISO, and other styles

2

Pancratov, Cosmin, Jacob M. Kurzer, Kelly A. Shaw, and Matthew L. Trawick. "Why Computer Architecture Matters: Memory Access." Computing in Science & Engineering 10, no. 4 (July 2008): 71–75. http://dx.doi.org/10.1109/mcse.2008.106.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Əzizxan oğlu Eyyubov, Ramazan, Leyla Elxan qızı Bayramova, and Zeynəb Mirsəməd qızı Sadıqova. "Computer architecture and John von Neumann principles." SCIENTIFIC WORK 15, no. 2 (March 9, 2021): 11–15. http://dx.doi.org/10.36719/2663-4619/63/11-15.

Full text

Abstract:

The program is stored in the machine's memory from any external device. The control device organizes its execution, taking into account the program in memory. The mathematical-logical device performs mathematical and logical calculations in accordance with the entered commands. Thus, the computer performs calculations without human assistance. Key words: computer, software, device, information, scheme

APA, Harvard, Vancouver, ISO, and other styles

4

Yantır, Hasan Erdem, Ahmed M. Eltawil, and Khaled N. Salama. "Efficient Acceleration of Stencil Applications through In-Memory Computing." Micromachines 11, no. 6 (June 26, 2020): 622. http://dx.doi.org/10.3390/mi11060622.

Full text

Abstract:

The traditional computer architectures severely suffer from the bottleneck between the processing elements and memory that is the biggest barrier in front of their scalability. Nevertheless, the amount of data that applications need to process is increasing rapidly, especially after the era of big data and artificial intelligence. This fact forces new constraints in computer architecture design towards more data-centric principles. Therefore, new paradigms such as in-memory and near-memory processors have begun to emerge to counteract the memory bottleneck by bringing memory closer to computation or integrating them. Associative processors are a promising candidate for in-memory computation, which combines the processor and memory in the same location to alleviate the memory bottleneck. One of the applications that need iterative processing of a huge amount of data is stencil codes. Considering this feature, associative processors can provide a paramount advantage for stencil codes. For demonstration, two in-memory associative processor architectures for 2D stencil codes are proposed, implemented by both emerging memristor and traditional SRAM technologies. The proposed architecture achieves a promising efficiency for a variety of stencil applications and thus proves its applicability for scientific stencil computing.

APA, Harvard, Vancouver, ISO, and other styles

5

Waterson, Clare, and B. Keith Jenkins. "Shared-memory optical/electronic computer: architecture and control." Applied Optics 33, no. 8 (March 10, 1994): 1559. http://dx.doi.org/10.1364/ao.33.001559.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

AKL, SELIM G. "THREE COUNTEREXAMPLES TO DISPEL THE MYTH OF THE UNIVERSAL COMPUTER." Parallel Processing Letters 16, no. 03 (September 2006): 381–403. http://dx.doi.org/10.1142/s012962640600271x.

Full text

Abstract:

It is shown that the concept of a Universal Computer cannot be realized. Specifically, instances of a computable function [Formula: see text] are exhibited that cannot be computed on any machine [Formula: see text] that is capable of only a finite and fixed number of operations per step. This remains true even if the machine [Formula: see text] is endowed with an infinite memory and the ability to communicate with the outside world while it is attempting to compute [Formula: see text]. It also remains true if, in addition, [Formula: see text] is given an indefinite amount of time to compute [Formula: see text]. This result applies not only to idealized models of computation, such as the Turing Machine and the like, but also to all known general-purpose computers, including existing conventional computers (both sequential and parallel), as well as contemplated unconventional ones such as biological and quantum computers. Even accelerating machines (that is, machines that increase their speed at every step) cannot be universal.

APA, Harvard, Vancouver, ISO, and other styles

7

MILES, COE F., and DAVID ROGERS. "A BIOLOGICALLY MOTIVATED ASSOCIATIVE MEMORY ARCHITECTURE." International Journal of Neural Systems 04, no. 02 (June 1993): 109–27. http://dx.doi.org/10.1142/s0129065793000110.

Full text

Abstract:

A synthesis of analytical techniques from the fields of biology, mathematics, computer science and engineering are used to model the information processing characteristics of the mammalian cerebellar cortex. By viewing anatomically different neurons as representing network elements whose input-output functions are different, a mechanism for distributing information throughout the memory is proposed. The functional circuitry developed to implement this feature is called the microcircuit. Overlapping microcircuit activity is used to describe the memory's read and write operations. Key features of the memory model include: (1) its use of a sparse interconnection network, (2) its ability to manipulate very large input patterns, (3) its distributed storage of input data patterns and (4) its statistical reconstruction of stored patterns during memory read operations. Quantitative measures for the memory's recall fidelity and storage capacity are derived and results of computer simulations are presented.

APA, Harvard, Vancouver, ISO, and other styles

8

Jacobson, Peter, Bo Kågström, and Mikael Rännar. "Algorithm Development for Distributed Memory Multicomputers Using CONLAB." Scientific Programming 1, no. 2 (1992): 185–203. http://dx.doi.org/10.1155/1992/365325.

Full text

Abstract:

CONLAB (CONcurrent LABoratory) is an environment for developing algorithms for parallel computer architectures and for simulating different parallel architectures. A user can experimentally verify and obtain a picture of the real performance of a parallel algorithm executing on a simulated target architecture. CONLAB gives a high-level support for expressing computations and communications in a distributed memory multicomputer (DMM) environment. A development methodology for DMM algorithms that is based on different levels of abstraction of the problem, the target architecture, and the CONLAB language itself is presented and illustrated with two examples. Simulotion results for and real experiments on the Intel iPSC/2 hypercube are presented. Because CONLAB is developed to run on uniprocessor UNIX workstations, it is an educational tool that offers interactive (simulated) parallel computing to a wide audience.

APA, Harvard, Vancouver, ISO, and other styles

9

Jan, Yahya, and Lech Jóźwiak. "Communication and Memory Architecture Design of Application-Specific High-End Multiprocessors." VLSI Design 2012 (March 25, 2012): 1–20. http://dx.doi.org/10.1155/2012/794753.

Full text

Abstract:

This paper is devoted to the design of communication and memory architectures of massively parallel hardware multiprocessors necessary for the implementation of highly demanding applications. We demonstrated that for the massively parallel hardware multiprocessors the traditionally used flat communication architectures and multi-port memories do not scale well, and the memory and communication network influence on both the throughput and circuit area dominates the processors influence. To resolve the problems and ensure scalability, we proposed to design highly optimized application-specific hierarchical and/or partitioned communication and memory architectures through exploring and exploiting the regularity and hierarchy of the actual data flows of a given application. Furthermore, we proposed some data distribution and related data mapping schemes in the shared (global) partitioned memories with the aim to eliminate the memory access conflicts, as well as, to ensure that our communication design strategies will be applicable. We incorporated these architecture synthesis strategies into our quality-driven model-based multi-processor design method and related automated architecture exploration framework. Using this framework, we performed a large series of experiments that demonstrate many various important features of the synthesized memory and communication architectures. They also demonstrate that our method and related framework are able to efficiently synthesize well scalable memory and communication architectures even for the high-end multiprocessors. The gains as high as 12-times in performance and 25-times in area can be obtained when using the hierarchical communication networks instead of the flat networks. However, for the high parallelism levels only the partitioned approach ensures the scalability in performance.

APA, Harvard, Vancouver, ISO, and other styles

10

Rez, Peter, and D. J. Fathers. "Computer system architecture for image and spectral processing." Proceedings, annual meeting, Electron Microscopy Society of America 45 (August 1987): 92–95. http://dx.doi.org/10.1017/s0424820100125415.

Full text

Abstract:

In this paper we shall discuss digital imaging and spectroscopy systems from the perspective of a system designer and we shall concentrate on those design choices that limit performance in microscopy and analysis applications. The hardware of a computer system can be broken down into three main components. These are the processor which performs arithmetic and logical operations, the memory for storing data and instructions and the peripherals for long term data storage (disks, tapes) and communication with the outside world. Linking these components is a data highway or bus for passing digital information from one section of the machine to another. A good definition of a bus is a set of interconnections with a defined procedure (protocol) for information transmission. In many small systems the bus is not only a set of electrical connections but is also an enclosure (a backplane) into which the different modules (processor, memory, peripheral controllers) are added.

APA, Harvard, Vancouver, ISO, and other styles

11

Kim, Bo-Sung, and Jun-Dong Cho. "Maximizing Memory Data Reuse for Lower Power Motion Estimation." VLSI Design 14, no. 3 (January 1, 2002): 299–305. http://dx.doi.org/10.1080/10655140290011096.

Full text

Abstract:

This paper presents a new VLSI architecture of the Motion Estimation in MPEG-2. Previously, a number of full search block matching algorithms (BMA) and architectures using systolic array have been proposed for motion estimation. However, the architectures have an inefficiently large number of external memory accesses. Recently, to reduce the number of accesses in one search block, a block matching method within a search area to reuse the search data is provided using systolic process arrays. To further reduce the data access and computation time during the block matching, we propose a new approach through the reuse of the previously-search data in two dimensions. Our new architecture in this paper is an extension from our previous work such that we reuse the previously-searches area not only between two consecutive columns but also between two consecutive rows, so as to entirely remove redundant memory accesses. Experimental results show that our architecture of increased area by 81% can reduce 98% of memory accesses. Total power reduction is 86% in power estimation by SPICE model.

APA, Harvard, Vancouver, ISO, and other styles

12

Chen, Yuanyuan, Jing Chen, and Mingzhu Li. "Numerical Modeling of A New Virtual Trajectory Password Architecture." Journal of Physics: Conference Series 2068, no. 1 (October 1, 2021): 012013. http://dx.doi.org/10.1088/1742-6596/2068/1/012013.

Full text

Abstract:

Abstract With the development of digital technology, Computer technology, communication technology and multimedia technology gradually infiltrate into each other and become the main core of information technology. In information technology, digital, text, graphics, images, sound, video, animation and other information carriers are spread through computers and the Internet. Due to the openness, sharing, dynamic and other characteristics of the Internet, information security is threatened and interfered. Information security has become a strategic issue that people must pay attention to, which is related to social stability, economic development and national security. This paper studies a new kind of virtual track cryptography, which breaks through the traditional character memory method and transforms it into memory graphics by using mathematical modeling, and changes the surface memory method from memorizing numbers and characters to memorizing trajectory graphics, so as to make the traditional process of people memorizing passwords easier.

APA, Harvard, Vancouver, ISO, and other styles

13

Ahmad, Othman. "FPGA BASED INDIVIDUAL COMPUTER ARCHITECTURE LABORATORY EXERCISES." Journal of BIMP-EAGA Regional Development 3, no. 1 (December 15, 2017): 23–31. http://dx.doi.org/10.51200/jbimpeagard.v3i1.1026.

Full text

Abstract:

Computer Architecture is the study of digital computers towards designing, building and operating digital computers. Digital computers are vital for the modern living because they are essential in providing the intelligences in devices such as self-driving cars and smartphones. Computer Architecture is a core subject for the Electronic (Computer) Engineering course at the Universiti Malaysia Sabah that is compliant to the requirement of the Washington Accord as accredited by the Engineering Accreditation Council of the Board of Engineers of Malaysia (EAC). An FPGA (Field Programmable Gate Array) based Computer Architecture Laboratory had been developed to support the curriculum of this course. FPGA allows a sustainable implementation of laboratory exercises without resorting to poisonous fabrication of microelectronic devices and installation of integrated circuits. An FPGA is just a configurable and therefore reusable digital design component. Two established organisations promoting computer engineering curriculum, ACM and IEEE, encourages the use of FPGA in digital design in their latest recommendation and together with the EAC, emphasises the grasp of the fundamentals for each student. The laboratory exercises are individual exercises where each student is given a unique assignment. A laboratory manual is provided as a guide and project specification for each student but overall the concept of the laboratory exercise is a student-centred one. Each student is allowed to pace their effort to achieve the sessions of the laboratory exercises starting from session one to session ten. A quantitative analysis of the effectiveness of these laboratory sessions is carried out based on the numbers of students completing the laboratory sessions. These sessions start from an 1:FPGA tutorial to implementations of features of a microprocessor of 2:Immediate Load, 3:Immediate Load to Multiple Registers, 4:Addition, 5:Operation Code, 6:Program Memory, 7:Jump, 8:Conditional Jump, 9:Register to Register and 10:Input-Output. The results of three batches of students show that within the time limits of a one credit hour course, students had managed to complete some aspects of the implementation of a simple microprocessor.

APA, Harvard, Vancouver, ISO, and other styles

14

Miles, C. F., and C. D. Rogers. "The microcircuit associative memory: a biologically motivated memory architecture." IEEE Transactions on Neural Networks 5, no. 3 (May 1994): 424–35. http://dx.doi.org/10.1109/72.286913.

Full text

APA, Harvard, Vancouver, ISO, and other styles

15

Tausif, Mohd, Ekram Khan, Mohd Hasan, and Martin Reisslein. "Lifting-Based Fractional Wavelet Filter: Energy-Efficient DWT Architecture for Low-Cost Wearable Sensors." Advances in Multimedia 2020 (December 16, 2020): 1–13. http://dx.doi.org/10.1155/2020/8823689.

Full text

Abstract:

This paper proposes and evaluates the LFrWF, a novel lifting-based architecture to compute the discrete wavelet transform (DWT) of images using the fractional wavelet filter (FrWF). In order to reduce the memory requirement of the proposed architecture, only one image line is read into a buffer at a time. Aside from an LFrWF version with multipliers, i.e., the LFr WF m , we develop a multiplier-less LFrWF version, i.e., the LFr WF ml , which reduces the critical path delay (CPD) to the delay T a of an adder. The proposed LFr WF m and LFr WF ml architectures are compared in terms of the required adders, multipliers, memory, and critical path delay with state-of-the-art DWT architectures. Moreover, the proposed LFr WF m and LFr WF ml architectures, along with the state-of-the-art FrWF architectures (with multipliers (Fr WF m ) and without multipliers (Fr WF ml )) are compared through implementation on the same FPGA board. The LFr WF m requires 22% less look-up tables (LUT), 34% less flip-flops (FF), and 50% less compute cycles (CC) and consumes 65% less energy than the Fr WF m . Also, the proposed LFr WF ml architecture requires 50% less CC and consumes 43% less energy than the Fr WF ml . Thus, the proposed LFr WF m and LFr WF ml architectures appear suitable for computing the DWT of images on wearable sensors.

APA, Harvard, Vancouver, ISO, and other styles

16

Chi, Ye, Haikun Liu, Ganwei Peng, Xiaofei Liao, and Hai Jin. "Transformer: An OS-Supported Reconfigurable Hybrid Memory Architecture." Applied Sciences 12, no. 24 (December 18, 2022): 12995. http://dx.doi.org/10.3390/app122412995.

Full text

Abstract:

Non-volatile memories (NVMs) have aroused vast interest in hybrid memory systems due to their promising features of byte-addressability, high storage density, low cost per byte, and near-zero standby energy consumption. However, since NVMs have limited write endurance, high write latency, and high write energy consumption, it is still challenging to directly replace traditional dynamic random access memory (DRAM) with NVMs. Many studies propose to utilize NVM and DRAM in a hybrid memory system, and explore sophisticated memory management schemes to alleviate the impact of slow NVM on the performance of applications. A few studies architected DRAM and NVM in a cache/memory hierarchy. However, the storage and performance overhead of the cache metadata (i.e., tags) management is rather expensive in this hierarchical architecture. Some other studies architected NVM and DRAM in a single (flat) address space to form a parallel architecture. However, the hot page monitoring and migration are critical for the performance of applications in this architecture. In this paper, we propose Transformer, an OS-supported reconfigurable hybrid memory architecture to efficiently use DRAM and NVM without redesigning the hardware architecture. To identify frequently accessed (hot) memory pages for migration, we propose to count the number of page accesses in OSes by sampling the access bit of pages periodically. We further migrate the identified hot pages from NVM to DRAM to improve the performance of hybrid memory system. More importantly, Transformer can simulate a hierarchical hybrid memory architecture while DRAM and NVM are physically managed in a flat address space, and can dynamically shift the logical memory architecture between parallel and hierarchical architectures according to applications’ memory access patterns. Experimental results show that Transformer can improve the application performance by 62% on average (up to 2.7×) compared with an NVM-only system, and can also improve performance by up to 79% and 42% (21% and 24% on average) compared with hierarchical and parallel architectures, respectively.

APA, Harvard, Vancouver, ISO, and other styles

17

de Paula Neto, Fernando M., Adenilton J. da Silva, Wilson R. de Oliveira, and Teresa B. Ludermir. "Quantum probabilistic associative memory architecture." Neurocomputing 351 (July 2019): 101–10. http://dx.doi.org/10.1016/j.neucom.2019.03.078.

Full text

APA, Harvard, Vancouver, ISO, and other styles

18

Misev, Anastas, and Marjan Gusev. "Simulators for courses in advance computer architecture." Facta universitatis - series: Electronics and Energetics 18, no. 2 (2005): 237–52. http://dx.doi.org/10.2298/fuee0502237m.

Full text

Abstract:

The usage of simulator in teaching computer architecture courses has proven to be the most acceptable way, especially when the simulators offer rich graphical and visual representation of the architecture. In this paper we present several simulators used to teach ILP (Instruction Level of Parallelism) courses. The simulators cover wide area of concepts such as internal logic organization, datapath, control, memory behavior, register renaming, branch prediction, and overall out of order execution. Special dedicated simulators cover details in internal organization like Tomasulo approach and scoreboard for organization of reservation stations. This innovative approach in laboratory exercises is used for advanced ILP course.

APA, Harvard, Vancouver, ISO, and other styles

19

Kim, Hyunju, Mannhee Cho, Sanghyun Lee, Hyug Su Kwon, Woo Young Choi, and Youngmin Kim. "Content-Addressable Memory System Using a Nanoelectromechanical Memory Switch." Electronics 11, no. 3 (February 7, 2022): 481. http://dx.doi.org/10.3390/electronics11030481.

Full text

Abstract:

Content-addressable memory (CAM) performs a parallel search operation by comparing the search data with all content stored in memory during a single cycle, instead of finding the data using an address. Conventional CAM designs use a dynamic CMOS architecture for high matching speed and high density; however, such implementations require the use of system clocks, and thus, suffer from timing violations and design limitations, such as charge sharing. In this paper, we propose a static-based architecture for a low-power, high-speed binary CAM (BCAM) and ternary CAM (TCAM), using a nanoelectromechanical (NEM) memory switch for nonvolatile data storage. We designed the proposed CAM architectures on a 65 nm process node with a 1.2 V operating voltage. The results of the layout simulation show that the proposed design has up to 23% less propagation delay, three times less matching power, and 9.4 times less area than a conventional design.

APA, Harvard, Vancouver, ISO, and other styles

20

Schumacher, Tobias, Tim Süß, Christian Plessl, and Marco Platzner. "FPGA Acceleration of Communication-Bound Streaming Applications: Architecture Modeling and a 3D Image Compositing Case Study." International Journal of Reconfigurable Computing 2011 (2011): 1–11. http://dx.doi.org/10.1155/2011/760954.

Full text

Abstract:

Reconfigurable computers usually provide a limited number of different memory resources, such as host memory, external memory, and on-chip memory with different capacities and communication characteristics. A key challenge for achieving high-performance with reconfigurable accelerators is the efficient utilization of the available memory resources. A detailed knowledge of the memories' parameters is key for generating an optimized communication layout. In this paper, we discuss a benchmarking environment for generating such a characterization. The environment is built on IMORC, our architectural template and on-chip network for creating reconfigurable accelerators. We provide a characterization of the memory resources available on the XtremeData XD1000 reconfigurable computer. Based on this data, we present as a case study the implementation of a 3D image compositing accelerator that is able to double the frame rate of a parallel renderer.

APA, Harvard, Vancouver, ISO, and other styles

21

Matick, R. E. "Impact of memory systems on computer architecture and system organization." IBM Systems Journal 25, no. 3.4 (1986): 274–305. http://dx.doi.org/10.1147/sj.253.0274.

Full text

APA, Harvard, Vancouver, ISO, and other styles

22

Mastani, S. Aruna, and S. Kannappan. "Distributed Memory based Architecture for Multiplier." International Journal of Computing and Digital Systems 12, no. 3 (August 6, 2022): 523–32. http://dx.doi.org/10.12785/ijcds/120142.

Full text

APA, Harvard, Vancouver, ISO, and other styles

23

Wang, Yi, Mingxu Zhang, and Jing Yang. "Towards memory-efficient processing-in-memory architecture for convolutional neural networks." ACM SIGPLAN Notices 52, no. 5 (September 14, 2017): 81–90. http://dx.doi.org/10.1145/3140582.3081032.

Full text

APA, Harvard, Vancouver, ISO, and other styles

24

Nair, R., S. F. Antao, C. Bertolli, P. Bose, J. R. Brunheroto, T. Chen, C. Y. Cher, et al. "Active Memory Cube: A processing-in-memory architecture for exascale systems." IBM Journal of Research and Development 59, no. 2/3 (March 2015): 17:1–17:14. http://dx.doi.org/10.1147/jrd.2015.2409732.

Full text

APA, Harvard, Vancouver, ISO, and other styles

25

Carmichael, Patrick. "The Internet, Information Architecture and Community Memory." Journal of Computer-Mediated Communication 8, no. 2 (June 23, 2006): 0. http://dx.doi.org/10.1111/j.1083-6101.2003.tb00208.x.

Full text

APA, Harvard, Vancouver, ISO, and other styles

26

Krishnamoorthy, S., and A. Choudhary. "A Scalable Distributed Shared Memory Architecture." Journal of Parallel and Distributed Computing 22, no. 3 (September 1994): 547–54. http://dx.doi.org/10.1006/jpdc.1994.1110.

Full text

APA, Harvard, Vancouver, ISO, and other styles

27

SAHNI, SARTAJ. "DATA MANIPULATION ON THE DISTRIBUTED MEMORY BUS COMPUTER." Parallel Processing Letters 05, no. 01 (March 1995): 3–14. http://dx.doi.org/10.1142/s0129626495000023.

Full text

Abstract:

We consider fundamental data manipulation operations such as broadcasting, prefix sum, data sum, data shift, data accumulation, consecutive sum, adjacent sum, sorting, and random access reads and writes, and show how these may be performed on the distributed memory bus computer (DMBC). In addition, we study two image processing applications: shrinking and expanding, and template matching. The DMBC algorithms are generally simpler than corresponding algorithms of the same time complexity developed for other reconfigurable bus computers.

APA, Harvard, Vancouver, ISO, and other styles

28

Keyes, D. E., H. Ltaief, and G. Turkiyyah. "Hierarchical algorithms on hierarchical architectures." Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 378, no. 2166 (January 20, 2020): 20190055. http://dx.doi.org/10.1098/rsta.2019.0055.

Full text

Abstract:

A traditional goal of algorithmic optimality, squeezing out flops, has been superseded by evolution in architecture. Flops no longer serve as a reasonable proxy for all aspects of complexity. Instead, algorithms must now squeeze memory, data transfers, and synchronizations, while extra flops on locally cached data represent only small costs in time and energy. Hierarchically low-rank matrices realize a rarely achieved combination of optimal storage complexity and high-computational intensity for a wide class of formally dense linear operators that arise in applications for which exascale computers are being constructed. They may be regarded as algebraic generalizations of the fast multipole method. Methods based on these hierarchical data structures and their simpler cousins, tile low-rank matrices, are well proportioned for early exascale computer architectures, which are provisioned for high processing power relative to memory capacity and memory bandwidth. They are ushering in a renaissance of computational linear algebra. A challenge is that emerging hardware architecture possesses hierarchies of its own that do not generally align with those of the algorithm. We describe modules of a software toolkit, hierarchical computations on manycore architectures, that illustrate these features and are intended as building blocks of applications, such as matrix-free higher-order methods in optimization and large-scale spatial statistics. Some modules of this open-source project have been adopted in the software libraries of major vendors. This article is part of a discussion meeting issue ‘Numerical algorithms for high-performance computational science’.

APA, Harvard, Vancouver, ISO, and other styles

29

Popov, Oleksandr, and Oleksiy Chystiakov. "On the Efficiency of Algorithms with Multi-level Parallelism." Physico-mathematical modelling and informational technologies, no. 33 (September 5, 2021): 133–37. http://dx.doi.org/10.15407/fmmit2021.33.133.

Full text

Abstract:

The paper investigates the efficiency of algorithms for solving computational mathematics problems that use a multilevel model of parallel computing on heterogeneous computer systems. A methodology for estimating the acceleration of algorithms for computers using a multilevel model of parallel computing is proposed. As an example, the parallel algorithm of the iteration method on a subspace for solving the generalized algebraic problem of eigenvalues of symmetric positive definite matrices of sparse structure is considered. For the presented algorithms, estimates of acceleration coefficients and efficiency were obtained on computers of hybrid architecture using graphics accelerators, on multi-core computers with shared memory and multi-node computers of MIMD-architecture.

APA, Harvard, Vancouver, ISO, and other styles

30

Park, Naebeom, Sungju Ryu, Jaeha Kung, and Jae-Joon Kim. "High-throughput Near-Memory Processing on CNNs with 3D HBM-like Memory." ACM Transactions on Design Automation of Electronic Systems 26, no. 6 (June 28, 2021): 1–20. http://dx.doi.org/10.1145/3460971.

Full text

Abstract:

This article discusses the high-performance near-memory neural network (NN) accelerator architecture utilizing the logic die in three-dimensional (3D) High Bandwidth Memory– (HBM) like memory. As most of the previously reported 3D memory-based near-memory NN accelerator designs used the Hybrid Memory Cube (HMC) memory, we first focus on identifying the key differences between HBM and HMC in terms of near-memory NN accelerator design. One of the major differences between the two 3D memories is that HBM has the centralized through- silicon-via (TSV) channels while HMC has distributed TSV channels for separate vaults. Based on the observation, we introduce the Round-Robin Data Fetching and Groupwise Broadcast schemes to exploit the centralized TSV channels for improvement of the data feeding rate for the processing elements. Using synthesized designs in a 28-nm CMOS technology, performance and energy consumption of the proposed architectures with various dataflow models are evaluated. Experimental results show that the proposed schemes reduce the runtime by 16.4–39.3% on average and the energy consumption by 2.1–5.1% on average compared to conventional data fetching schemes.

APA, Harvard, Vancouver, ISO, and other styles

31

Kim, Youngsik, Tack-Don Han, and Shin-Dug Kim. "Impact of the memory interface structure in the memory-processor integrated architecture for computer vision." Journal of Systems Architecture 46, no. 3 (January 2000): 259–74. http://dx.doi.org/10.1016/s1383-7621(99)00005-3.

Full text

APA, Harvard, Vancouver, ISO, and other styles

32

Huh, Joonmoo, and Deokwoo Lee. "Effective On-Chip Communication for Message Passing Programs on Multi-Core Processors." Electronics 10, no. 21 (November 3, 2021): 2681. http://dx.doi.org/10.3390/electronics10212681.

Full text

Abstract:

Shared memory is the most popular parallel programming model for multi-core processors, while message passing is generally used for large distributed machines. However, as the number of cores on a chip increases, the relative merits of shared memory versus message passing change, and we argue that message passing becomes a viable, high performing, and parallel programming model. To demonstrate this hypothesis, we compare a shared memory architecture with a new message passing architecture on a suite of applications tuned for each system independently. Perhaps surprisingly, the fundamental behaviors of the applications studied in this work, when optimized for both models, are very similar to each other, and both could execute efficiently on multicore architectures despite many implementations being different from each other. Furthermore, if hardware is tuned to support message passing by supporting bulk message transfer and the elimination of unnecessary coherence overheads, and if effective support is available for global operations, then some applications would perform much better on a message passing architecture. Leveraging our insights, we design a message passing architecture that supports both memory-to-memory and cache-to-cache messaging in hardware. With the new architecture, message passing is able to outperform its shared memory counterparts on many of the applications due to the unique advantages of the message passing hardware as compared to cache coherence. In the best case, message passing achieves up to a 34% increase in speed over its shared memory counterpart, and it achieves an average 10% increase in speed. In the worst case, message passing is slowed down in two applications—CG (conjugate gradient) and FT (Fourier transform)—because it could not perform well on the unique data sharing patterns as its counterpart of shared memory. Overall, our analysis demonstrates the importance of considering message passing as a high performing and hardware-supported programming model on future multicore architectures.

APA, Harvard, Vancouver, ISO, and other styles

33

Inggs, Cornelia P., and Howard Barringer. "CTL* Model Checking on a Shared-Memory Architecture." Electronic Notes in Theoretical Computer Science 128, no. 3 (April 2005): 107–23. http://dx.doi.org/10.1016/j.entcs.2004.10.022.

Full text

APA, Harvard, Vancouver, ISO, and other styles

34

Kammler, David, Ernst Martin Witte, Anupam Chattopadhyay, Bastian Bauwens, Gerd Ascheid, Rainer Leupers, and Heinrich Meyr. "Automatic Generation of Memory Interfaces for ASIPs." International Journal of Embedded and Real-Time Communication Systems 1, no. 3 (July 2010): 1–23. http://dx.doi.org/10.4018/jertcs.2010070101.

Full text

Abstract:

With the growing market for multi-processor system-on-chip (MPSoC) solutions, application-specific instruction-set processors (ASIPs) gain importance as they allow for a wide tradeoff between flexibility and efficiency in such a system. Their development is aided by architecture description languages (ADLs) supporting the automatic generation of architecture-specific tool sets as well as synthesizable register transfer level (RTL) implementations from a single architecture model. However, these generated implementations have to be manually adapted to the interfaces of dedicated memories or memory controllers, slowing down the design-space exploration regarding the memory architecture. To overcome this drawback, the authors extend RTL code generation from ADL models with the automatic generation of memory interfaces. This is accomplished by introducing a new abstract and versatile description format for memory interfaces and their timing protocols. The feasibility of this approach is demonstrated in real-life case studies, including a design space exploration for a banked memory system.

APA, Harvard, Vancouver, ISO, and other styles

35

TIRUVEEDHULA, V., and J. S. BEDI. "Performance of hypercube architecture with shared memory." International Journal of Systems Science 25, no. 4 (April 1994): 695–705. http://dx.doi.org/10.1080/00207729408928990.

Full text

APA, Harvard, Vancouver, ISO, and other styles

36

Liebendorfer, Adam. "Non-volatile memory devices offer alternative computer architecture for neural networks." Scilight 2020, no. 23 (June 5, 2020): 231104. http://dx.doi.org/10.1063/10.0001403.

Full text

APA, Harvard, Vancouver, ISO, and other styles

37

Moreno, Lorenzo, Evelio J. González, Beatrice Popescu, Jonay Toledo, Jesús Torres, and Carina Gonzalez. "MNEME: A memory hierarchy simulator for an engineering computer architecture course." Computer Applications in Engineering Education 19, no. 2 (April 21, 2011): 358–64. http://dx.doi.org/10.1002/cae.20317.

Full text

APA, Harvard, Vancouver, ISO, and other styles

38

Irakliotis, Leo J., Carl W. Wilmsen, and Pericles A. Mitkas. "The Optical Memory–Electric Computer Interface as a Parallel Processing Architecture." Journal of Parallel and Distributed Computing 41, no. 1 (February 1997): 67–77. http://dx.doi.org/10.1006/jpdc.1996.1286.

Full text

APA, Harvard, Vancouver, ISO, and other styles

39

Rashid, Nafiul, Berken Utku Demirel, Mohanad Odema, and Mohammad Abdullah Al Faruque. "Template Matching Based Early Exit CNN for Energy-efficient Myocardial Infarction Detection on Low-power Wearable Devices." Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 6, no. 2 (July 4, 2022): 1–22. http://dx.doi.org/10.1145/3534580.

Full text

Abstract:

Myocardial Infarction (MI), also known as heart attack, is a life-threatening form of heart disease that is a leading cause of death worldwide. Its recurrent and silent nature emphasizes the need for continuous monitoring through wearable devices. The wearable device solutions should provide adequate performance while being resource-constrained in terms of power and memory. This paper proposes an MI detection methodology using a Convolutional Neural Network (CNN) that outperforms the state-of-the-art works on wearable devices for two datasets - PTB and PTB-XL, while being energy and memory-efficient. Moreover, we also propose a novel Template Matching based Early Exit (TMEX) CNN architecture that further increases the energy efficiency compared to baseline architecture while maintaining similar performance. Our baseline and TMEX architecture achieve 99.33% and 99.24% accuracy on PTB dataset, whereas on PTB-XL dataset they achieve 84.36% and 84.24% accuracy, respectively. Both architectures are suitable for wearable devices requiring only 20 KB of RAM. Evaluation of real hardware shows that our baseline architecture is 0.6x to 53x more energy-efficient than the state-of-the-art works on wearable devices. Moreover, our TMEX architecture further improves the energy efficiency by 8.12% (PTB) and 6.36% (PTB-XL) while maintaining similar performance as the baseline architecture.

APA, Harvard, Vancouver, ISO, and other styles

40

Bolosky, William J., Michael L. Scott, Robert P. Fitzgerald, Robert J. Fowler, and Alan L. Cox. "NUMA policies and their relation to memory architecture." ACM SIGPLAN Notices 26, no. 4 (April 2, 1991): 212–21. http://dx.doi.org/10.1145/106973.106994.

Full text

APA, Harvard, Vancouver, ISO, and other styles

41

Zhang, Weihua, Xinglong Qian, Ye Wang, Binyu Zang, and Chuanqi Zhu. "Optimizing compiler for shared-memory multiple SIMD architecture." ACM SIGPLAN Notices 41, no. 7 (July 12, 2006): 199–208. http://dx.doi.org/10.1145/1159974.1134679.

Full text

APA, Harvard, Vancouver, ISO, and other styles

42

Sinha, Mitali, Gade Sri Harsha, Pramit Bhattacharyya, and Sujay Deb. "Design Space Optimization of Shared Memory Architecture in Accelerator-rich Systems." ACM Transactions on Design Automation of Electronic Systems 26, no. 4 (April 2021): 1–31. http://dx.doi.org/10.1145/3446001.

Full text

Abstract:

Shared memory architectures, as opposed to private-only memories, provide a viable alternative to meet the ever-increasing memory requirements of multi-accelerator systems to achieve high performance under stringent area and energy constraints. However, an impulsive memory sharing degrades performance due to network contention and latency to access shared memory. We propose the Accelerator Shared Memory (ASM) framework to provide an optimal private/shared memory configuration and shared data allocation under a system’s resource and network constraints. Evaluations show ASM provides up to 34.35% and 31.34% improvement in performance and energy, respectively, over baseline systems.

APA, Harvard, Vancouver, ISO, and other styles

43

Steane, A. M. "Quantum computer architecture for fast entropy extraction." Quantum Information and Computation 2, no. 4 (June 2002): 297–306. http://dx.doi.org/10.26421/qic2.4-3.

Full text

Abstract:

If a quantum computer is stabilized by fault-tolerant quantum error correction (QEC), then most of its resources (qubits and operations) are dedicated to the extraction of error information. Analysis of this process leads to a set of central requirements for candidate computing devices, in addition to the basic ones of stable qubits and controllable gates and measurements. The logical structure of the extraction process has a natural geometry and hierarchy of communication needs; a computer whose physical architecture is designed to reflect this will be able to tolerate the most noise. The relevant networks are dominated by quantum information transport, therefore to assess a computing device it is necessary to characterize its ability to transport quantum information, in addition to assessing the performance of conditional logic on nearest neighbours and the passive stability of the memory. The transport distances involved in QEC networks are estimated, and it is found that a device relying on swap operations for information transport must have those operations an order of magnitude more precise than the controlled gates of a device which can transport information at low cost.

APA, Harvard, Vancouver, ISO, and other styles

44

Alachiotis, Nikolaos, Panagiotis Skrimponis, Manolis Pissadakis, and Dionisios Pnevmatikatos. "Scalable Phylogeny Reconstruction with Disaggregated Near-memory Processing." ACM Transactions on Reconfigurable Technology and Systems 15, no. 3 (September 30, 2022): 1–32. http://dx.doi.org/10.1145/3484983.

Full text

Abstract:

Disaggregated computer architectures eliminate resource fragmentation in next-generation datacenters by enabling virtual machines to employ resources such as CPUs, memory, and accelerators that are physically located on different servers. While this paves the way for highly compute- and/or memory-intensive applications to potentially deploy all CPUs and/or memory resources in a datacenter, it poses a major challenge to the efficient deployment of hardware accelerators: input/output data can reside on different servers than the ones hosting accelerator resources, thereby requiring time- and energy-consuming remote data transfers that diminish the gains of hardware acceleration. Targeting a disaggregated datacenter architecture similar to the IBM dReDBox disaggregated datacenter prototype, the present work explores the potential of deploying custom acceleration units adjacently to the disaggregated-memory controller on memory bricks (in dReDBox terminology), which is implemented on FPGA technology, to reduce data movement and improve performance and energy efficiency when reconstructing large phylogenies (evolutionary relationships among organisms). A fundamental computational kernel is the Phylogenetic Likelihood Function (PLF), which dominates the total execution time (up to 95%) of widely used maximum-likelihood methods. Numerous efforts to boost PLF performance over the years focused on accelerating computation; since the PLF is a data-intensive, memory-bound operation, performance remains limited by data movement, and memory disaggregation only exacerbates the problem. We describe two near-memory processing models, one that addresses the problem of workload distribution to memory bricks, which is particularly tailored toward larger genomes (e.g., plants and mammals), and one that reduces overall memory requirements through memory-side data interpolation transparently to the application, thereby allowing the phylogeny size to scale to a larger number of organisms without requiring additional memory.

APA, Harvard, Vancouver, ISO, and other styles

45

Wei, Rongshan, Chenjia Li, Chuandong Chen, Guangyu Sun, and Minghua He. "Memory Access Optimization of a Neural Network Accelerator Based on Memory Controller." Electronics 10, no. 4 (February 10, 2021): 438. http://dx.doi.org/10.3390/electronics10040438.

Full text

Abstract:

Special accelerator architecture has achieved great success in processor architecture, and it is trending in computer architecture development. However, as the memory access pattern of an accelerator is relatively complicated, the memory access performance is relatively poor, limiting the overall performance improvement of hardware accelerators. Moreover, memory controllers for hardware accelerators have been scarcely researched. We consider that a special accelerator memory controller is essential for improving the memory access performance. To this end, we propose a dynamic random access memory (DRAM) memory controller called NNAMC for neural network accelerators, which monitors the memory access stream of an accelerator and transfers it to the optimal address mapping scheme bank based on the memory access characteristics. NNAMC includes a stream access prediction unit (SAPU) that analyzes the type of data stream accessed by the accelerator via hardware, and designs the address mapping for different banks using a bank partitioning model (BPM). The image mapping method and hardware architecture were analyzed in a practical neural network accelerator. In the experiment, NNAMC achieved significantly lower access latency of the hardware accelerator than the competing address mapping schemes, increased the row buffer hit ratio by 13.68% on average (up to 26.17%), reduced the system access latency by 26.3% on average (up to 37.68%), and lowered the hardware cost. In addition, we also confirmed that NNAMC efficiently adapted to different network parameters.

APA, Harvard, Vancouver, ISO, and other styles

46

Giannoula, Christina, Ivan Fernandez, Juan Gómez-Luna, Nectarios Koziris, Georgios Goumas, and Onur Mutlu. "Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Architectures." ACM SIGMETRICS Performance Evaluation Review 50, no. 1 (June 20, 2022): 33–34. http://dx.doi.org/10.1145/3547353.3522661.

Full text

Abstract:

Several manufacturers have already started to commercialize near-bank Processing-In-Memory (PIM) architectures, after decades of research efforts. Near-bank PIM architectures place simple cores close to DRAM banks. Recent research demonstrates that they can yield significant performance and energy improvements in parallel applications by alleviating data access costs. Real PIM systems can provide high levels of parallelism, large aggregate memory bandwidth and low memory access latency, thereby being a good fit to accelerate the Sparse Matrix Vector Multiplication (SpMV) kernel. SpMV has been characterized as one of the most significant and thoroughly studied scientific computation kernels. It is primarily a memory-bound kernel with intensive memory accesses due its algorithmic nature, the compressed matrix format used, and the sparsity patterns of the input matrices given. This paper provides the first comprehensive analysis of SpMV on a real-world PIM architecture, and presents SparseP, the first SpMV library for real PIM architectures. We make two key contributions. First, we design efficient SpMV algorithms to accelerate the SpMV kernel in current and future PIM systems, while covering a wide variety of sparse matrices with diverse sparsity patterns. Second, we provide the first comprehensive analysis of SpMV on a real PIM architecture. Specifically, we conduct our rigorous experimental analysis of SpMV kernels in the UPMEM PIM system, the first publicly-available real-world PIM architecture. Our extensive evaluation provides new insights and recommendations for software designers and hardware architects to efficiently accelerate the SpMV kernel on real PIM systems. For more information about our thorough characterization on the SpMV PIM execution, results, insights and the open-source SparseP software package [21], we refer the reader to the full version of the paper [3, 4]. The SparseP software package is publicly and freely available at https://github.com/CMU-SAFARI/SparseP.

APA, Harvard, Vancouver, ISO, and other styles

47

Kjelsø, M., M. Gooch, and S. Jones. "Performance evaluation of computer architectures with main memory data compression." Journal of Systems Architecture 45, no. 8 (February 1999): 571–90. http://dx.doi.org/10.1016/s1383-7621(98)00006-x.

Full text

APA, Harvard, Vancouver, ISO, and other styles

48

Saponara, Sergio, and Luca Fanucci. "Homogeneous and Heterogeneous MPSoC Architectures with Network-On-Chip Connectivity for Low-Power and Real-Time Multimedia Signal Processing." VLSI Design 2012 (August 14, 2012): 1–17. http://dx.doi.org/10.1155/2012/450302.

Full text

Abstract:

Two multiprocessor system-on-chip (MPSoC) architectures are proposed and compared in the paper with reference to audio and video processing applications. One architecture exploits a homogeneous topology; it consists of 8 identical tiles, each made of a 32-bit RISC core enhanced by a 64-bit DSP coprocessor with local memory. The other MPSoC architecture exploits a heterogeneous-tile topology with on-chip distributed memory resources; the tiles act as application specific processors supporting a different class of algorithms. In both architectures, the multiple tiles are interconnected by a network-on-chip (NoC) infrastructure, through network interfaces and routers, which allows parallel operations of the multiple tiles. The functional performances and the implementation complexity of the NoC-based MPSoC architectures are assessed by synthesis results in submicron CMOS technology. Among the large set of supported algorithms, two case studies are considered: the real-time implementation of an H.264/MPEG AVC video codec and of a low-distortion digital audio amplifier. The heterogeneous architecture ensures a higher power efficiency and a smaller area occupation and is more suited for low-power multimedia processing, such as in mobile devices. The homogeneous scheme allows for a higher flexibility and easier system scalability and is more suited for general-purpose DSP tasks in power-supplied devices.

APA, Harvard, Vancouver, ISO, and other styles

49

Idris, F., and S. Panchanathan. "Associative memory architecture for video compression." IEE Proceedings - Computers and Digital Techniques 142, no. 1 (1995): 55. http://dx.doi.org/10.1049/ip-cdt:19951615.

Full text

APA, Harvard, Vancouver, ISO, and other styles

50

Woolbright, David, Vladimir Zanev, and Neal Rogers. "VisibleZ: A Mainframe Architecture Emulator for Computing Education." Serdica Journal of Computing 8, no. 4 (October 2, 2015): 389–408. http://dx.doi.org/10.55630/sjc.2014.8.389-408.

Full text

Abstract:

This paper describes a PC-based mainframe computer emulatorcalled VisibleZ and its use in teaching mainframe Computer Organizationand Assembly Programming classes. VisibleZ models IBM’s z/Architectureand allows direct interpretation of mainframe assembly language objectcode in a graphical user interface environment that was developed in Java.The VisibleZ emulator acts as an interactive visualization tool to simulateenterprise computer architecture. The provided architectural componentsinclude main storage, CPU, registers, Program Status Word (PSW), andI/O Channels. Particular attention is given to providing visual clues tothe user by color-coding screen components, machine instruction execution,and animation of the machine architecture components. Students interact with VisibleZ by executing machine instructions in a step-by-stepmode, simultaneously observing the contents of memory, registers, and changes inthe PSW during the fetch-decode-execute machine instruction cycle. Theobject-oriented design and implementation of VisibleZ allows students todevelop their own instruction semantics by coding Java for existing specificz/Architecture machine instructions or design and implement new machineinstructions. The use of VisibleZ in lectures, labs, and assignments is describedin the paper and supported by a website that hosts an extensivecollection of related materials. VisibleZ has been proven a useful tool inmainframe Assembly Language Programming and Computer Organizationclasses. Using VisibleZ, students develop a better understanding of mainframe concepts, components, and how the mainframe computer works.ACM Computing Classification System (1998): C.0, K.3.2.

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!