Log in

Relevant bibliographies by topics / Architecture manycore / Journal articles

To see the other types of publications on this topic, follow the link: Architecture manycore.

Journal articles on the topic 'Architecture manycore'

Author: Grafiati

Published: 4 June 2021

Last updated: 25 May 2024

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Architecture manycore.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Muddukrishna, Ananya, Peter A. Jonsson, and Mats Brorsson. "Locality-Aware Task Scheduling and Data Distribution for OpenMP Programs on NUMA Systems and Manycore Processors." Scientific Programming 2015 (2015): 1–16. http://dx.doi.org/10.1155/2015/981759.

Full text

Abstract:

Performance degradation due to nonuniform data access latencies has worsened on NUMA systems and can now be felt on-chip in manycore processors. Distributing data across NUMA nodes and manycore processor caches is necessary to reduce the impact of nonuniform latencies. However, techniques for distributing data are error-prone and fragile and require low-level architectural knowledge. Existing task scheduling policies favor quick load-balancing at the expense of locality and ignore NUMA node/manycore cache access latencies while scheduling. Locality-aware scheduling, in conjunction with or as a replacement for existing scheduling, is necessary to minimize NUMA effects and sustain performance. We present a data distribution and locality-aware scheduling technique for task-based OpenMP programs executing on NUMA systems and manycore processors. Our technique relieves the programmer from thinking of NUMA system/manycore processor architecture details by delegating data distribution to the runtime system and uses task data dependence information to guide the scheduling of OpenMP tasks to reduce data stall times. We demonstrate our technique on a four-socket AMD Opteron machine with eight NUMA nodes and on the TILEPro64 processor and identify that data distribution and locality-aware task scheduling improve performance up to 69% for scientific benchmarks compared to default policies and yet provide an architecture-oblivious approach for programmers.

APA, Harvard, Vancouver, ISO, and other styles

2

Choudhury, Dwaipayan, Aravind Sukumaran Rajam, Ananth Kalyanaraman, and Partha Pratim Pande. "High-Performance and Energy-Efficient 3D Manycore GPU Architecture for Accelerating Graph Analytics." ACM Journal on Emerging Technologies in Computing Systems 18, no. 1 (January 31, 2022): 1–19. http://dx.doi.org/10.1145/3482880.

Full text

Abstract:

Recent advances in GPU-based manycore accelerators provide the opportunity to efficiently process large-scale graphs on chip. However, real world graphs have a diverse range of topology and connectivity patterns (e.g., degree distributions) that make the design of input-agnostic hardware architectures a challenge. Network-on-Chip (NoC)- based architectures provide a way to overcome this challenge as the architectural topology can be used to approximately model the expected traffic patterns that emerge from graph application workloads. In this paper, we first study the mix of long- and short-range traffic patterns generated on-chip using graph workloads, and subsequently use the findings to adapt the design of an optimal NoC-based architecture. In particular, by leveraging emerging three-dimensional (3D) integration technology, we propose design of a small-world NoC (SWNoC)- enabled manycore GPU architecture, where the placement of the links connecting the streaming multiprocessors (SM) and the memory controllers (MC) follow a power-law distribution. The proposed 3D manycore GPU architecture outperforms the traditional planar (2D) counterparts in both performance and energy consumption. Moreover, by adopting a joint performance-thermal optimization strategy, we address the thermal concerns in a 3D design without noticeably compromising the achievable performance. The 3D integration technology is also leveraged to incorporate Near Data Processing (NDP) to complement the performance benefits introduced by the SWNoC architecture. As graph applications are inherently memory intensive, off-chip data movement gives rise to latency and energy overheads in the presence of external DRAM. In conventional GPU architectures, as the main memory layer is not integrated with the logic, off-chip data movement negatively impacts overall performance and energy consumption. We demonstrate that NDP significantly reduces the overheads associated with such frequent and irregular memory accesses in graph-based applications. The proposed SWNoC-enabled NDP framework that integrates 3D memory (like Micron's HMC) with a massive number of GPU cores achieves 29.5% performance improvement and 30.03% less energy consumption on average compared to a conventional planar Mesh-based design with external DRAM.

APA, Harvard, Vancouver, ISO, and other styles

3

Korolija, Nenad, and Kent Milfeld. "Towards hybrid supercomputing architectures." Journal of Computer and Forensic Sciences 1, no. 1 (2022): 47–54. http://dx.doi.org/10.5937/1-42710.

Full text

Abstract:

In light of recent work on combining control-flow and dataflow architectures on the same chip die, a new architecture based on an asymmetric multicore processor is proposed. The control-flow architectures are described as a most commonly used computer architecture today. Both multicore and manycore architectures are explained, as they are based on the same principles. A dataflow computing model assumes that data input flows through hardware as either a software or hardware dataflow implementation. In software dataflow, processors based on the control-flow paradigm process tasks based on their availability from the same queue (if there are any). In hardware dataflow architectures, the hardware is configured for a particular algorithm, and data input is streamed into the hardware, and the output is streamed back to the multicore processor for further processing. Hardware dataflow architectures are usually implemented with FPGAs. Hybrid architectures employ asymmetric multicore and manycore computer architectures that are based on the control-flow and hardware dataflow architecture, all combined on the same chip die. Advantages include faster processing time, lower power consumption (and heating), and less space needed for the hardware.

APA, Harvard, Vancouver, ISO, and other styles

4

Arka, Aqeeb Iqbal, Biresh Kumar Joardar, Ryan Gary Kim, Dae Hyun Kim, Janardhan Rao Doppa, and Partha Pratim Pande. "HeM3D." ACM Transactions on Design Automation of Electronic Systems 26, no. 2 (February 2021): 1–21. http://dx.doi.org/10.1145/3424239.

Full text

Abstract:

Heterogeneous manycore architectures are the key to efficiently execute compute- and data-intensive applications. Through-silicon-via (TSV)-based 3D manycore system is a promising solution in this direction as it enables the integration of disparate computing cores on a single system. Recent industry trends show the viability of 3D integration in real products (e.g., Intel Lakefield SoC Architecture, the AMD Radeon R9 Fury X graphics card, and Xilinx Virtex-7 2000T/H580T, etc.). However, the achievable performance of conventional TSV-based 3D systems is ultimately bottlenecked by the horizontal wires (wires in each planar die). Moreover, current TSV 3D architectures suffer from thermal limitations. Hence, TSV-based architectures do not realize the full potential of 3D integration. Monolithic 3D (M3D) integration, a breakthrough technology to achieve “More Moore and More Than Moore,” opens up the possibility of designing cores and associated network routers using multiple layers by utilizing monolithic inter-tier vias (MIVs) and hence, reducing the effective wire length. Compared to TSV-based 3D integrated circuits (ICs), M3D offers the “true” benefits of vertical dimension for system integration: the size of an MIV used in M3D is over 100 × smaller than a TSV. This dramatic reduction in via size and the resulting increase in density opens up numerous opportunities for design optimizations in 3D manycore systems: designers can use up to millions of MIVs for ultra-fine-grained 3D optimization, where individual cores and routers can be spread across multiple tiers for extreme power and performance optimization. In this work, we demonstrate how M3D-enabled vertical core and uncore elements offer significant performance and thermal improvements in manycore heterogeneous architectures compared to its TSV-based counterpart. To overcome the difficult optimization challenges due to the large design space and complex interactions among the heterogeneous components (CPU, GPU, Last Level Cache, etc.) in a M3D-based manycore chip, we leverage novel design-space exploration algorithms to trade off different objectives. The proposed M3D-enabled heterogeneous architecture, called HeM3D , outperforms its state-of-the-art TSV-equivalent counterpart by up to 18.3% in execution time while being up to 19°C cooler.

APA, Harvard, Vancouver, ISO, and other styles

5

Lahdhiri, Habiba, Jordane Lorandel, Salvatore Monteleone, Emmanuelle Bourdel, and Maurizio Palesi. "Framework for Design Exploration and Performance Analysis of RF-NoC Manycore Architecture." Journal of Low Power Electronics and Applications 10, no. 4 (November 3, 2020): 37. http://dx.doi.org/10.3390/jlpea10040037.

Full text

Abstract:

The Network-on-chip (NoC) paradigm has been proposed as a promising solution to enable the handling of a high degree of integration in multi-/many-core architectures. Despite their advantages, wired NoC infrastructures are facing several performance issues regarding multi-hop long-distance communications. RF-NoC is an attractive solution offering high performance and multicast/broadcast capabilities. However, managing RF links is a critical aspect that relies on both application-dependent and architectural parameters. This paper proposes a design space exploration framework for OFDMA-based RF-NoC architecture, which takes advantage of both real application benchmarks simulated using Sniper and RF-NoC architecture modeled using Noxim. We adopted the proposed framework to finely configure a routing algorithm, working with real traffic, achieving up to 45% of delay reduction, compared to a wired NoC setup in similar conditions.

APA, Harvard, Vancouver, ISO, and other styles

6

LI, Hongliang, Fang ZHENG, Ziyu HAO, Hongguang GAO, Feng GUO, Yong TANG, Hui LV, Xin LIU, and Fangyuan CHEN. "Research on homegrown manycore architecture for intelligent computing." SCIENTIA SINICA Informationis 49, no. 3 (March 1, 2019): 247–55. http://dx.doi.org/10.1360/n112018-00283.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

Dévigne, Clément, Jean-Baptiste Bréjon, Quentin L. Meunier, and Franck Wajsbürt. "Executing secured virtual machines within a manycore architecture." Microprocessors and Microsystems 48 (February 2017): 21–35. http://dx.doi.org/10.1016/j.micpro.2016.09.008.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Li, Mingzhen, Yi Liu, Hailong Yang, Zhongzhi Luan, Lin Gan, Guangwen Yang, and Depei Qian. "Accelerating Sparse Cholesky Factorization on Sunway Manycore Architecture." IEEE Transactions on Parallel and Distributed Systems 31, no. 7 (July 1, 2020): 1636–50. http://dx.doi.org/10.1109/tpds.2019.2953852.

Full text

APA, Harvard, Vancouver, ISO, and other styles

9

Hosseini, Morteza, and Tinoosh Mohsenin. "Binary Precision Neural Network Manycore Accelerator." ACM Journal on Emerging Technologies in Computing Systems 17, no. 2 (April 2021): 1–27. http://dx.doi.org/10.1145/3423136.

Full text

Abstract:

This article presents a low-power, programmable, domain-specific manycore accelerator, Binarized neural Network Manycore Accelerator (BiNMAC), which adopts and efficiently executes binary precision weight/activation neural network models. Such networks have compact models in which weights are constrained to only 1 bit and can be packed several in one memory entry that minimizes memory footprint to its finest. Packing weights also facilitates executing single instruction, multiple data with simple circuitry that allows maximizing performance and efficiency. The proposed BiNMAC has light-weight cores that support domain-specific instructions, and a router-based memory access architecture that helps with efficient implementation of layers in binary precision weight/activation neural networks of proper size. With only 3.73% and 1.98% area and average power overhead, respectively, novel instructions such as Combined Population-Count-XNOR , Patch-Select , and Bit-based Accumulation are added to the instruction set architecture of the BiNMAC, each of which replaces execution cycles of frequently used functions with 1 clock cycle that otherwise would have taken 54, 4, and 3 clock cycles, respectively. Additionally, customized logic is added to every core to transpose 16×16-bit blocks of memory on a bit-level basis, that expedites reshaping intermediate data to be well-aligned for bitwise operations. A 64-cluster architecture of the BiNMAC is fully placed and routed in 65-nm TSMC CMOS technology, where a single cluster occupies an area of 0.53 mm 2 with an average power of 232 mW at 1-GHz clock frequency and 1.1 V. The 64-cluster architecture takes 36.5 mm 2 area and, if fully exploited, consumes a total power of 16.4 W and can perform 1,360 Giga Operations Per Second (GOPS) while providing full programmability. To demonstrate its scalability, four binarized case studies including ResNet-20 and LeNet-5 for high-performance image classification, as well as a ConvNet and a multilayer perceptron for low-power physiological applications were implemented on BiNMAC. The implementation results indicate that the population-count instruction alone can expedite the performance by approximately 5×. When other new instructions are added to a RISC machine with existing population-count instruction, the performance is increased by 58% on average. To compare the performance of the BiNMAC with other commercial-off-the-shelf platforms, the case studies with their double-precision floating-point models are also implemented on the NVIDIA Jetson TX2 SoC (CPU+GPU). The results indicate that, within a margin of ∼2.1%--9.5% accuracy loss, BiNMAC on average outperforms the TX2 GPU by approximately 1.9× (or 7.5× with fabrication technology scaled) in energy consumption for image classification applications. On low power settings and within a margin of ∼3.7%--5.5% accuracy loss compared to ARM Cortex-A57 CPU implementation, BiNMAC is roughly ∼9.7×--17.2× (or 38.8×--68.8× with fabrication technology scaled) more energy efficient for physiological applications while meeting the application deadline.

APA, Harvard, Vancouver, ISO, and other styles

10

Silva, Bruno A. da, Arthur M. Lima, Janier Arias-Garcia, Michael Huebner, and Jones Yudi. "A Manycore Vision Processor for Real-Time Smart Cameras." Sensors 21, no. 21 (October 27, 2021): 7137. http://dx.doi.org/10.3390/s21217137.

Full text

Abstract:

Real-time image processing and computer vision systems are now in the mainstream of technologies enabling applications for cyber-physical systems, Internet of Things, augmented reality, and Industry 4.0. These applications bring the need for Smart Cameras for local real-time processing of images and videos. However, the massive amount of data to be processed within short deadlines cannot be handled by most commercial cameras. In this work, we show the design and implementation of a manycore vision processor architecture to be used in Smart Cameras. With massive parallelism exploration and application-specific characteristics, our architecture is composed of distributed processing elements and memories connected through a Network-on-Chip. The architecture was implemented as an FPGA overlay, focusing on optimized hardware utilization. The parameterized architecture was characterized by its hardware occupation, maximum operating frequency, and processing frame rate. Different configurations ranging from one to eighty-one processing elements were implemented and compared to several works from the literature. Using a System-on-Chip composed of an FPGA integrated into a general-purpose processor, we showcase the flexibility and efficiency of the hardware/software architecture. The results show that the proposed architecture successfully allies programmability and performance, being a suitable alternative for future Smart Cameras.

APA, Harvard, Vancouver, ISO, and other styles

11

Karcher, Thomas, Christoph Schaefer, and Victor Pankratius. "Auto-tuning support for manycore applications." ACM SIGOPS Operating Systems Review 43, no. 2 (April 21, 2009): 96–97. http://dx.doi.org/10.1145/1531793.1531808.

Full text

APA, Harvard, Vancouver, ISO, and other styles

12

Lin, Chit-Kwan, Andreas Wild, Gautham N. Chinya, Tsung-Han Lin, Mike Davies, and Hong Wang. "Mapping spiking neural networks onto a manycore neuromorphic architecture." ACM SIGPLAN Notices 53, no. 4 (December 2, 2018): 78–89. http://dx.doi.org/10.1145/3296979.3192371.

Full text

APA, Harvard, Vancouver, ISO, and other styles

13

Li, Sheng, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. "The McPAT Framework for Multicore and Manycore Architectures." ACM Transactions on Architecture and Code Optimization 10, no. 1 (April 2013): 1–29. http://dx.doi.org/10.1145/2445572.2445577.

Full text

APA, Harvard, Vancouver, ISO, and other styles

14

Park, Hana, Young-Woong Ko, Jungmin So, and Jeong-Gun Lee. "Synthesizable Manycore Processor Designs with FPGA in Teaching Computer Architecture." International Journal of Control and Automation 6, no. 5 (October 31, 2013): 429–38. http://dx.doi.org/10.14257/ijca.2013.6.5.38.

Full text

APA, Harvard, Vancouver, ISO, and other styles

15

Sharma, Harsh, Lukas Pfromm, Rasit Onur Topaloglu, Janardhan Rao Doppa, Umit Y. Ogras, Ananth Kalyanraman, and Partha Pratim Pande. "Florets for Chiplets: Data Flow-aware High-Performance and Energy-efficient Network-on-Interposer for CNN Inference Tasks." ACM Transactions on Embedded Computing Systems 22, no. 5s (September 9, 2023): 1–21. http://dx.doi.org/10.1145/3608098.

Full text

Abstract:

Recent advances in 2.5D chiplet platforms provide a new avenue for compact scale-out implementations of emerging compute- and data-intensive applications including machine learning. Network-on-Interposer (NoI) enables integration of multiple chiplets on a 2.5D system. While these manycore platforms can deliver high computational throughput and energy efficiency by running multiple specialized tasks concurrently, conventional NoI architectures have a limited computational throughput due to their inherent multi-hop topologies. In this paper, we propose Floret, a novel NoI architecture based on space-filling curves (SFCs). The Floret architecture leverages suitable task mapping, exploits the data flow pattern, and optimizes the inter-chiplet data exchange to extract high performance for multiple types of convolutional neural network (CNN) inference tasks running concurrently. We demonstrate that the Floret architecture reduces the latency and energy up to 58% and 64%, respectively, compared to state-of-the-art NoI architectures while executing datacenter-scale workloads involving multiple CNN tasks simultaneously. Floret achieves high performance and significant energy savings with much lower fabrication cost by exploiting the data-flow awareness of the CNN inference tasks.

APA, Harvard, Vancouver, ISO, and other styles

16

Serpa, Matheus S., Eduardo HM Cruz, Matthias Diener, Arthur M. Krause, Philippe OA Navaux, Jairo Panetta, Albert Farrés, Claudia Rosas, and Mauricio Hanzich. "Optimization strategies for geophysics models on manycore systems." International Journal of High Performance Computing Applications 33, no. 3 (January 17, 2019): 473–86. http://dx.doi.org/10.1177/1094342018824150.

Full text

Abstract:

Many software mechanisms for geophysics exploration in oil and gas industries are based on wave propagation simulation. To perform such simulations, state-of-the-art high-performance computing architectures are employed, generating results faster with more accuracy at each generation. The software must evolve to support the new features of each design to keep performance scaling. Furthermore, it is important to understand the impact of each change applied to the software to improve the performance as most as possible. In this article, we propose several optimization strategies for a wave propagation model for six architectures: Intel Broadwell, Intel Haswell, Intel Knights Landing, Intel Knights Corner, NVIDIA Pascal, and NVIDIA Kepler. We focus on improving the cache memory usage, vectorization, load balancing, portability, and locality in the memory hierarchy. We analyze the hardware impact of the optimizations, providing insights of how each strategy can improve the performance. The results show that NVIDIA Pascal outperforms the other considered architectures by up to 8.5[Formula: see text].

APA, Harvard, Vancouver, ISO, and other styles

17

Madroñal, D., R. Lazcano, R. Salvador, H. Fabelo, S. Ortega, G. M. Callico, E. Juarez, and C. Sanz. "SVM-based real-time hyperspectral image classifier on a manycore architecture." Journal of Systems Architecture 80 (October 2017): 30–40. http://dx.doi.org/10.1016/j.sysarc.2017.08.002.

Full text

APA, Harvard, Vancouver, ISO, and other styles

18

Sharafeddin, Mageda, and Haitham Akkary. "A small and power efficient checkpoint core architecture for manycore processors." International Journal of High Performance Systems Architecture 5, no. 4 (2015): 216. http://dx.doi.org/10.1504/ijhpsa.2015.072852.

Full text

APA, Harvard, Vancouver, ISO, and other styles

19

Liu, Feiyang, Haibo Zhang, Yawen Chen, Zhiyi Huang, and Huaxi Gu. "Wavelength-Reused Hierarchical Optical Network on Chip Architecture for Manycore Processors." IEEE Transactions on Sustainable Computing 4, no. 2 (April 1, 2019): 231–44. http://dx.doi.org/10.1109/tsusc.2017.2733551.

Full text

APA, Harvard, Vancouver, ISO, and other styles

20

Halappanavar, Mahantesh, John Feo, Oreste Villa, Antonino Tumeo, and Alex Pothen. "Approximate weighted matching on emerging manycore and multithreaded architectures." International Journal of High Performance Computing Applications 26, no. 4 (August 9, 2012): 413–30. http://dx.doi.org/10.1177/1094342012452893.

Full text

Abstract:

Graph matching is a prototypical combinatorial problem with many applications in high-performance scientific computing. Optimal algorithms for computing matchings are challenging to parallelize. Approximation algorithms are amenable to parallelization and are therefore important to compute matchings for large-scale problems. Approximation algorithms also generate nearly optimal solutions that are sufficient for many applications. In this paper we present multithreaded algorithms for computing half-approximate weighted matching on state-of-the-art multicore (Intel Nehalem and AMD Magny-Cours), manycore (Nvidia Tesla and Nvidia Fermi), and massively multithreaded (Cray XMT) platforms. We provide two implementations: the first uses shared work queues and is suited for all platforms; and the second implementation, based on dataflow principles, exploits special features available on the Cray XMT. Using a carefully chosen dataset that exhibits characteristics from a wide range of applications, we show scalable performance across different platforms. In particular, for one instance of the input, an R-MAT graph (RMAT-G), we show speedups of about [Formula: see text] on [Formula: see text] cores of an AMD Magny-Cours, [Formula: see text] on [Formula: see text] cores of Intel Nehalem, [Formula: see text] on Nvidia Tesla and [Formula: see text] on Nvidia Fermi relative to one core of Intel Nehalem, and [Formula: see text] on [Formula: see text] processors of Cray XMT. We demonstrate strong as well as weak scaling for graphs with up to a billion edges using up to 12,800 threads. We avoid excessive fine-tuning for each platform and retain the basic structure of the algorithm uniformly across platforms. An exception is the dataflow algorithm designed specifically for the Cray XMT. To the best of the authors' knowledge, this is the first such large-scale study of the half-approximate weighted matching problem on multithreaded platforms. Driven by the critical enabling role of combinatorial algorithms such as matching in scientific computing and the emergence of informatics applications, there is a growing demand to support irregular computations on current and future computing platforms. In this context, we evaluate the capability of emerging multithreaded platforms to tolerate latency induced by irregular memory access patterns, and to support fine-grained parallelism via light-weight synchronization mechanisms. By contrasting the architectural features of these platforms against the Cray XMT, which is specifically designed to support irregular memory-intensive applications, we delineate the impact of these choices on performance.

APA, Harvard, Vancouver, ISO, and other styles

21

Kim, Dongki, Sungjoo Yoo, and Sunggu Lee. "A network congestion-aware memory subsystem for manycore." ACM Transactions on Embedded Computing Systems 12, no. 4 (June 2013): 1–18. http://dx.doi.org/10.1145/2485984.2485998.

Full text

APA, Harvard, Vancouver, ISO, and other styles

22

Haggui, Olfa, Claude Tadonki, Lionel Lacassagne, Fatma Sayadi, and Bouraoui Ouni. "Harris corner detection on a NUMA manycore." Future Generation Computer Systems 88 (November 2018): 442–52. http://dx.doi.org/10.1016/j.future.2018.01.048.

Full text

APA, Harvard, Vancouver, ISO, and other styles

23

Li, Wenzhe, Bingli Guo, Xin Li, Yu Zhou, Shanguo Huang, and George N. Rouskas. "A large-scale nesting ring multi-chip architecture for manycore processor systems." Optical Switching and Networking 31 (January 2019): 183–92. http://dx.doi.org/10.1016/j.osn.2018.10.004.

Full text

APA, Harvard, Vancouver, ISO, and other styles

24

Sepúlveda, Johanna, Vania Marangozova-Martin, and Jeronimo Castrillon. "Architecture, Languages, Compilation and Hardware support for Emerging ManYcore systems (ALCHEMY): Preface." Procedia Computer Science 108 (2017): 1071–72. http://dx.doi.org/10.1016/j.procs.2017.05.276.

Full text

APA, Harvard, Vancouver, ISO, and other styles

25

Martínez, Héctor, Sergio Barrachina, Maribel Castillo, Joaquín Tárraga, Ignacio Medina, Joaquín Dopazo, and Enrique S. Quintana-Ortí. "A framework for genomic sequencing on clusters of multicore and manycore processors." International Journal of High Performance Computing Applications 32, no. 3 (June 22, 2016): 393–406. http://dx.doi.org/10.1177/1094342016653243.

Full text

Abstract:

The advances in genomic sequencing during the past few years have motivated the development of fast and reliable software for DNA/RNA sequencing on current high performance architectures. Most of these efforts target multicore processors, only a few can also exploit graphics processing units, and a much smaller set will run in clusters equipped with any of these multi-threaded architecture technologies. Furthermore, the examples that can be used on clusters today are all strongly coupled with a particular aligner. In this paper we introduce an alignment framework that can be leveraged to coordinately run any “single-node” aligner, taking advantage of the resources of a cluster without having to modify any portion of the original software. The key to our transparent migration lies in hiding the complexity associated with the multi-node execution (such as coordinating the processes running in the cluster nodes) inside the generic-aligner framework. Moreover, following the design and operation in our Message Passing Interface (MPI) version of HPG Aligner RNA BWT, we organize the framework into two stages in order to be able to execute different aligners in each one of them. With this configuration, for example, the first stage can ideally apply a fast aligner to accelerate the process, while the second one can be tuned to act as a refinement stage that further improves the global alignment process with little cost.

APA, Harvard, Vancouver, ISO, and other styles

26

Joardar, Biresh Kumar, Janardhan Rao Doppa, Hai Li, Krishnendu Chakrabarty, and Partha Pratim Pande. "Learning to Train CNNs on Faulty ReRAM-based Manycore Accelerators." ACM Transactions on Embedded Computing Systems 20, no. 5s (October 31, 2021): 1–23. http://dx.doi.org/10.1145/3476986.

Full text

Abstract:

The growing popularity of convolutional neural networks (CNNs) has led to the search for efficient computational platforms to accelerate CNN training. Resistive random-access memory (ReRAM)-based manycore architectures offer a promising alternative to commonly used GPU-based platforms for training CNNs. However, due to the immature fabrication process and limited write endurance, ReRAMs suffer from different types of faults. This makes training of CNNs challenging as weights are misrepresented when they are mapped to faulty ReRAM cells. This results in unstable training, leading to unacceptably low accuracy for the trained model. Due to the distributed nature of the mapping of the individual bits of a weight to different ReRAM cells, faulty weights often lead to exploding gradients. This in turn introduces a positive feedback in the training loop, resulting in extremely large and unstable weights. In this paper, we propose a lightweight and reliable CNN training methodology using weight clipping to prevent this phenomenon and enable training even in the presence of many faults. Weight clipping prevents large weights from destabilizing CNN training and provides the backpropagation algorithm with the opportunity to compensate for the weights mapped to faulty cells. The proposed methodology achieves near-GPU accuracy without introducing significant area or performance overheads. Experimental evaluation indicates that weight clipping enables the successful training of CNNs in the presence of faults, while also reducing training time by 4 X on average compared to a conventional GPU platform. Moreover, we also demonstrate that weight clipping outperforms a recently proposed error correction code (ECC)-based method when training is carried out using faulty ReRAMs.

APA, Harvard, Vancouver, ISO, and other styles

27

Abdi, Daniel S., Francis X. Giraldo, Emil M. Constantinescu, Lester E. Carr, Lucas C. Wilcox, and Timothy C. Warburton. "Acceleration of the IMplicit–EXplicit nonhydrostatic unified model of the atmosphere on manycore processors." International Journal of High Performance Computing Applications 33, no. 2 (October 31, 2017): 242–67. http://dx.doi.org/10.1177/1094342017732395.

Full text

Abstract:

We present the acceleration of an IMplicit–EXplicit (IMEX) nonhydrostatic atmospheric model on manycore processors such as graphic processing units (GPUs) and Intel’s Many Integrated Core (MIC) architecture. IMEX time integration methods sidestep the constraint imposed by the Courant–Friedrichs–Lewy condition on explicit methods through corrective implicit solves within each time step. In this work, we implement and evaluate the performance of IMEX on manycore processors relative to explicit methods. Using 3D-IMEX at Courant number C = 15, we obtained a speedup of about 4× relative to an explicit time stepping method run with the maximum allowable C = 1. Moreover, the unconditional stability of IMEX with respect to the fast waves means the speedup can increase significantly with the Courant number as long as the accuracy of the resulting solution is acceptable. We show a speedup of 100× at C = 150 using 1D-IMEX to demonstrate this point. Several improvements on the IMEX procedure were necessary in order to outperform our results with explicit methods: (a) reducing the number of degrees of freedom of the IMEX formulation by forming the Schur complement, (b) formulating a horizontally explicit vertically implicit 1D-IMEX scheme that has a lower workload and better scalability than 3D-IMEX, (c) using high-order polynomial preconditioners to reduce the condition number of the resulting system, and (d) using a direct solver for the 1D-IMEX method by performing and storing LU factorizations once to obtain a constant cost for any Courant number. Without all of these improvements, explicit time integration methods turned out to be difficult to beat. We discuss in detail the IMEX infrastructure required for formulating and implementing efficient methods on manycore processors. Several parametric studies are conducted to demonstrate the gain from each of the abovementioned improvements. Finally, we validate our results with standard benchmark problems in numerical weather prediction and evaluate the performance and scalability of the IMEX method using up to 4192 GPUs and 16 Knights Landing processors.

APA, Harvard, Vancouver, ISO, and other styles

28

Ryu, Hoon, and Seungmin Lee. "Cost-efficient simulations of large-scale electronic structures in the standalone manycore architecture." Computer Physics Communications 267 (October 2021): 108078. http://dx.doi.org/10.1016/j.cpc.2021.108078.

Full text

APA, Harvard, Vancouver, ISO, and other styles

29

Costa, Evaldo Bezerra, Gabriel Pereira Silva, and Marcello Goulart Teixeira. "An Approach to Parallel Algorithms for Long DNA Sequences Alignment on Manycore Architecture." Journal of Computational Biology 27, no. 8 (August 1, 2020): 1248–52. http://dx.doi.org/10.1089/cmb.2019.0362.

Full text

APA, Harvard, Vancouver, ISO, and other styles

30

Lazcano, R., D. Madroñal, H. Fabelo, S. Ortega, R. Salvador, G. M. Callico, E. Juarez, and C. Sanz. "Adaptation of an Iterative PCA to a Manycore Architecture for Hyperspectral Image Processing." Journal of Signal Processing Systems 91, no. 7 (May 19, 2018): 759–71. http://dx.doi.org/10.1007/s11265-018-1380-9.

Full text

APA, Harvard, Vancouver, ISO, and other styles

31

Salvana, Mary Lai O., Sameh Abdulah, Huang Huang, Hatem Ltaief, Ying Sun, Marc G. Genton, and David E. Keyes. "High Performance Multivariate Geospatial Statistics on Manycore Systems." IEEE Transactions on Parallel and Distributed Systems 32, no. 11 (November 1, 2021): 2719–33. http://dx.doi.org/10.1109/tpds.2021.3071423.

Full text

APA, Harvard, Vancouver, ISO, and other styles

32

Finkbeiner, Jan, Thomas Gmeinder, Mark Pupilli, Alexander Titterton, and Emre Neftci. "Harnessing Manycore Processors with Distributed Memory for Accelerated Training of Sparse and Recurrent Models." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 11 (March 24, 2024): 11996–2005. http://dx.doi.org/10.1609/aaai.v38i11.29087.

Full text

Abstract:

Current AI training infrastructure is dominated by single instruction multiple data (SIMD) and systolic array architectures, such as Graphics Processing Units (GPUs) and Tensor Processing Units (TPUs), that excel at accelerating parallel workloads and dense vector matrix multiplications. Potentially more efficient neural network models utilizing sparsity and recurrence cannot leverage the full power of SIMD processor and are thus at a severe disadvantage compared to today's prominent parallel architectures like Transformers and CNNs, thereby hindering the path towards more sustainable AI. To overcome this limitation, we explore sparse and recurrent model training on a massively parallel multiple instruction multiple data (MIMD) architecture with distributed local memory. We implement a training routine based on backpropagation though time (BPTT) for the brain-inspired class of Spiking Neural Networks (SNNs) that feature binary sparse activations. We observe a massive advantage in using sparse activation tensors with a MIMD processor, the Intelligence Processing Unit (IPU) compared to GPUs. On training workloads, our results demonstrate 5-10x throughput gains compared to A100 GPUs and up to 38x gains for higher levels of activation sparsity, without a significant slowdown in training convergence or reduction in final model performance. Furthermore, our results show highly promising trends for both single and multi IPU configurations as we scale up to larger model sizes. Our work paves the way towards more efficient, non-standard models via AI training hardware beyond GPUs, and competitive large scale SNN models.

APA, Harvard, Vancouver, ISO, and other styles

33

Langer, Akhil, Ehsan Totoni, Udatta Palekar, and Laxmikant V. Kalé. "Energy-optimal configuration selection for manycore chips with variation." International Journal of High Performance Computing Applications 31, no. 5 (October 13, 2016): 451–66. http://dx.doi.org/10.1177/1094342016672082.

Full text

Abstract:

Operating chips at high energy efficiency is one of the major challenges for modern large-scale supercomputers. Low-voltage operation of transistors increases the energy efficiency but leads to frequency and power variation across cores on the same chip. Finding energy-optimal configurations for such chips is a hard problem. In this work, we study how integer linear programming techniques can be used to obtain energy-efficient configurations of chips that have heterogeneous cores. Our proposed methodologies give optimal configurations as compared with competent but sub-optimal heuristics while having negligible timing overhead. The proposed ParSearch method gives up to 13.2% and 7% savings in energy while causing only 2% increase in execution time of two HPC applications: miniMD and Jacobi, respectively. Our results show that integer linear programming can be a very powerful online method to obtain energy-optimal configurations.

APA, Harvard, Vancouver, ISO, and other styles

34

Savas, Süleyman, Zain Ul-Abdin, and Tomas Nordström. "A framework to generate domain-specific manycore architectures from dataflow programs." Microprocessors and Microsystems 72 (February 2020): 102908. http://dx.doi.org/10.1016/j.micpro.2019.102908.

Full text

APA, Harvard, Vancouver, ISO, and other styles

35

Koziolek, Heiko, Steffen Becker, Jens Happe, Petr Tuma, and Thijmen de Gooijer. "Towards software performance engineering for multicore and manycore systems." ACM SIGMETRICS Performance Evaluation Review 41, no. 3 (January 10, 2014): 2–11. http://dx.doi.org/10.1145/2567529.2567531.

Full text

APA, Harvard, Vancouver, ISO, and other styles

36

Xiaowen Wu, Jiang Xu, Yaoyao Ye, Xuan Wang, Mahdi Nikdast, Zhehui Wang, and Zhe Wang. "An Inter/Intra-Chip Optical Network for Manycore Processors." IEEE Transactions on Very Large Scale Integration (VLSI) Systems 23, no. 4 (April 2015): 678–91. http://dx.doi.org/10.1109/tvlsi.2014.2319089.

Full text

APA, Harvard, Vancouver, ISO, and other styles

37

Davies, Mike, Narayan Srinivasa, Tsung-Han Lin, Gautham Chinya, Yongqiang Cao, Sri Harsha Choday, Georgios Dimou, et al. "Loihi: A Neuromorphic Manycore Processor with On-Chip Learning." IEEE Micro 38, no. 1 (January 2018): 82–99. http://dx.doi.org/10.1109/mm.2018.112130359.

Full text

APA, Harvard, Vancouver, ISO, and other styles

38

Lyberis, Spyros, George Kalokerinos, Michalis Lygerakis, Vassilis Papaefstathiou, Iakovos Mavroidis, Manolis Katevenis, Dionisios Pnevmatikatos, and Dimitrios S. Nikolopoulos. "FPGA prototyping of emerging manycore architectures for parallel programming research using Formic boards." Journal of Systems Architecture 60, no. 6 (June 2014): 481–93. http://dx.doi.org/10.1016/j.sysarc.2014.03.002.

Full text

APA, Harvard, Vancouver, ISO, and other styles

39

Pantoja, Maria, Maxence Weyrich, and Gerardo Fernández-Escribano. "Acceleration of MRI analysis using multicore and manycore paradigms." Journal of Supercomputing 76, no. 11 (January 18, 2020): 8679–90. http://dx.doi.org/10.1007/s11227-020-03154-9.

Full text

APA, Harvard, Vancouver, ISO, and other styles

40

Flich, José, Giovanni Agosta, Philipp Ampletzer, David Atienza Alonso, Carlo Brandolese, Etienne Cappe, Alessandro Cilardo, et al. "Exploring manycore architectures for next-generation HPC systems through the MANGO approach." Microprocessors and Microsystems 61 (September 2018): 154–70. http://dx.doi.org/10.1016/j.micpro.2018.05.011.

Full text

APA, Harvard, Vancouver, ISO, and other styles

41

Benchehida, Chawki, Mohammed Kamel Benhaoua, Houssam Eddine Zahaf, and Giuseppe Lipari. "Memory-processor co-scheduling for real-time tasks on network-on-chip manycore architectures." International Journal of High Performance Systems Architecture 11, no. 1 (2022): 1. http://dx.doi.org/10.1504/ijhpsa.2022.121877.

Full text

APA, Harvard, Vancouver, ISO, and other styles

42

Lipari, Giuseppe, Chawki Benchehida, Houssam Eddine Zahaf, and Mohammed Kamel Benhaoua. "Memory-processor co-scheduling for real-time tasks on network-on-chip manycore architectures." International Journal of High Performance Systems Architecture 11, no. 1 (2022): 1. http://dx.doi.org/10.1504/ijhpsa.2022.10045987.

Full text

APA, Harvard, Vancouver, ISO, and other styles

43

Keyes, D. E., H. Ltaief, and G. Turkiyyah. "Hierarchical algorithms on hierarchical architectures." Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 378, no. 2166 (January 20, 2020): 20190055. http://dx.doi.org/10.1098/rsta.2019.0055.

Full text

Abstract:

A traditional goal of algorithmic optimality, squeezing out flops, has been superseded by evolution in architecture. Flops no longer serve as a reasonable proxy for all aspects of complexity. Instead, algorithms must now squeeze memory, data transfers, and synchronizations, while extra flops on locally cached data represent only small costs in time and energy. Hierarchically low-rank matrices realize a rarely achieved combination of optimal storage complexity and high-computational intensity for a wide class of formally dense linear operators that arise in applications for which exascale computers are being constructed. They may be regarded as algebraic generalizations of the fast multipole method. Methods based on these hierarchical data structures and their simpler cousins, tile low-rank matrices, are well proportioned for early exascale computer architectures, which are provisioned for high processing power relative to memory capacity and memory bandwidth. They are ushering in a renaissance of computational linear algebra. A challenge is that emerging hardware architecture possesses hierarchies of its own that do not generally align with those of the algorithm. We describe modules of a software toolkit, hierarchical computations on manycore architectures, that illustrate these features and are intended as building blocks of applications, such as matrix-free higher-order methods in optimization and large-scale spatial statistics. Some modules of this open-source project have been adopted in the software libraries of major vendors. This article is part of a discussion meeting issue ‘Numerical algorithms for high-performance computational science’.

APA, Harvard, Vancouver, ISO, and other styles

44

Zhao, Han, Quan Chen, Yuxian Qiu, Ming Wu, Yao Shen, Jingwen Leng, Chao Li, and Minyi Guo. "Bandwidth and Locality Aware Task-stealing for Manycore Architectures with Bandwidth-Asymmetric Memory." ACM Transactions on Architecture and Code Optimization 15, no. 4 (January 8, 2019): 1–26. http://dx.doi.org/10.1145/3291058.

Full text

APA, Harvard, Vancouver, ISO, and other styles

45

Tang, Xulong, Mahmut Taylan Kandemir, and Mustafa Karakoy. "Mix and Match: Reorganizing Tasks for Enhancing Data Locality." ACM SIGMETRICS Performance Evaluation Review 49, no. 1 (June 22, 2022): 47–48. http://dx.doi.org/10.1145/3543516.3460103.

Full text

Abstract:

Application programs that exhibit strong locality of reference lead to minimized cache misses and better performance in different architectures. In this paper, we target task-based programs, and propose a novel compiler-based approach that consists of four complementary steps. First, we partition the original tasks in the target application into sub-tasks and build a data reuse graph at a sub-task granularity. Second, based on the intensity of temporal and spatial data reuses among sub-tasks, we generate new tasks where each such (new) task includes a set of sub-tasks that exhibit high data reuse among them. Third, we assign the newly-generated tasks to cores in an architecture-aware fashion with the knowledge of data location. Finally, we re-schedule the execution order of sub-tasks within new tasks such that sub-tasks that belong to different tasks but share data among them are executed in close proximity in time. The experiments show that, when targeting a state of the art manycore system, our compiler-based approach improves the performance of 10 multithreaded programs by 23.4% on average.

APA, Harvard, Vancouver, ISO, and other styles

46

Pasricha, Sudeep, and Mahdi Nikdast. "A Survey of Silicon Photonics for Energy-Efficient Manycore Computing." IEEE Design & Test 37, no. 4 (August 2020): 60–81. http://dx.doi.org/10.1109/mdat.2020.2982628.

Full text

APA, Harvard, Vancouver, ISO, and other styles

47

Li, Mengquan, Weichen Liu, Nan Guan, Yiyuan Xie, and Yaoyao Ye. "Hardware-Software Collaborative Thermal Sensing in Optical Network-on-Chip--based Manycore Systems." ACM Transactions on Embedded Computing Systems 18, no. 6 (January 22, 2020): 1–24. http://dx.doi.org/10.1145/3362099.

Full text

APA, Harvard, Vancouver, ISO, and other styles

48

Tang, Xulong, Mahmut Taylan Kandemir, and Mustafa Karakoy. "Mix and Match: Reorganizing Tasks for Enhancing Data Locality." Proceedings of the ACM on Measurement and Analysis of Computing Systems 5, no. 2 (June 2021): 1–24. http://dx.doi.org/10.1145/3460087.

Full text

Abstract:

Application programs that exhibit strong locality of reference lead to minimized cache misses and better performance in different architectures. However, to maximize the performance of multithreaded applications running on emerging manycore systems, data movement in on-chip network should also be minimized. Unfortunately, the way many multithreaded programs are written does not lend itself well to minimal data movement. Motivated by this observation, in this paper, we target task-based programs (which cover a large set of available multithreaded programs), and propose a novel compiler-based approach that consists of four complementary steps. First, we partition the original tasks in the target application into sub-tasks and build a data reuse graph at a sub-task granularity. Second, based on the intensity of temporal and spatial data reuses among sub-tasks, we generate new tasks where each such (new) task includes a set of sub-tasks that exhibit high data reuse among them. Third, we assign the newly-generated tasks to cores in an architecture-aware fashion with the knowledge of data location. Finally, we re-schedule the execution order of sub-tasks within new tasks such that sub-tasks that belong to different tasks but share data among them are executed in close proximity in time. The detailed experiments show that, when targeting a state of the art manycore system, our proposed compiler-based approach improves the performance of 10 multithreaded programs by 23.4% on average, and it also outperforms two state-of-the-art data access optimizations for all the benchmarks tested. Our results also show that the proposed approach i) improves the performance of multiprogrammed workloads, and ii) generates results that are close to maximum savings that could be achieved with perfect profiling information. Overall, our experimental results emphasize the importance of dividing an original set of tasks of an application into sub-tasks and constructing new tasks from the resulting sub-tasks in a data movement- and locality-aware fashion.

APA, Harvard, Vancouver, ISO, and other styles

49

Karahoda, Sertaç, Osman Tufan Erenay, Kamer Kaya, Uraz Cengiz Türker, and Hüsnü Yenigün. "Multicore and manycore parallelization of cheap synchronizing sequence heuristics." Journal of Parallel and Distributed Computing 140 (June 2020): 13–24. http://dx.doi.org/10.1016/j.jpdc.2020.02.009.

Full text

APA, Harvard, Vancouver, ISO, and other styles

50

Lei Zhang, Yinhe Han, Qiang Xu, Xiao wei Li, and Huawei Li. "On Topology Reconfiguration for Defect-Tolerant NoC-Based Homogeneous Manycore Systems." IEEE Transactions on Very Large Scale Integration (VLSI) Systems 17, no. 9 (September 2009): 1173–86. http://dx.doi.org/10.1109/tvlsi.2008.2002108.

Full text

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!