Log in

Relevant bibliographies by topics / CPU-GPU Partitioning / Journal articles

To see the other types of publications on this topic, follow the link: CPU-GPU Partitioning.

Journal articles on the topic 'CPU-GPU Partitioning'

Author: Grafiati

Published: 22 February 2025

Last updated: 31 July 2025

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 46 journal articles for your research on the topic 'CPU-GPU Partitioning.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Benatia, Akrem, Weixing Ji, Yizhuo Wang, and Feng Shi. "Sparse matrix partitioning for optimizing SpMV on CPU-GPU heterogeneous platforms." International Journal of High Performance Computing Applications 34, no. 1 (2019): 66–80. http://dx.doi.org/10.1177/1094342019886628.

Full text

Abstract:

Sparse matrix–vector multiplication (SpMV) kernel dominates the computing cost in numerous applications. Most of the existing studies dedicated to improving this kernel have been targeting just one type of processing units, mainly multicore CPUs or graphics processing units (GPUs), and have not explored the potential of the recent, rapidly emerging, CPU-GPU heterogeneous platforms. To take full advantage of these heterogeneous systems, the input sparse matrix has to be partitioned on different available processing units. The partitioning problem is more challenging with the existence of many s

APA, Harvard, Vancouver, ISO, and other styles

2

Narayana, Divyaprabha Kabbal, and Sudarshan Tekal Subramanyam Babu. "Optimal task partitioning to minimize failure in heterogeneous computational platform." International Journal of Electrical and Computer Engineering (IJECE) 15, no. 1 (2025): 1079. http://dx.doi.org/10.11591/ijece.v15i1.pp1079-1088.

Full text

Abstract:

The increased energy consumption by heterogeneous cloud platforms surges the carbon emissions and reduces system reliability, thus, making workload scheduling an extremely challenging process. The dynamic voltage- frequency scaling (DVFS) technique provides an efficient mechanism in improving the energy efficiency of cloud platform; however, employing DVFS reduces reliability and increases the failure rate of resource scheduling. Most of the current workload scheduling methods have failed to optimize the energy and reliability together under a central processing unit - graphical processing uni

APA, Harvard, Vancouver, ISO, and other styles

3

Huijing Yang and Tingwen Yu. "Two novel cache management mechanisms on CPU-GPU heterogeneous processors." Research Briefs on Information and Communication Technology Evolution 7 (June 15, 2021): 1–8. http://dx.doi.org/10.56801/rebicte.v7i.113.

Full text

Abstract:

Heterogeneous multicore processors that take full advantage of CPUs and GPUs within the samechip raise an emerging challenge for sharing a series of on-chip resources, particularly Last-LevelCache (LLC) resources. Since the GPU core has good parallelism and memory latency tolerance,the majority of the LLC space is utilized by GPU applications. Under the current cache managementpolicies, the LLC sharing of CPU applications can be remarkably decreased due to the existence ofGPU workloads, thus seriously affecting the overall performance. To alleviate the unfair contentionwithin CPUs and GPUs for

APA, Harvard, Vancouver, ISO, and other styles

4

Narayana, Divyaprabha Kabbal, and Sudarshan Tekal Subramanyam Babu. "Optimal task partitioning to minimize failure in heterogeneous computational platform." International Journal of Electrical and Computer Engineering (IJECE) 15 (February 1, 2025): 1079–88. https://doi.org/10.11591/ijece.v15i1.pp1079-1088.

Full text

Abstract:

The increased energy consumption by heterogeneous cloud platforms surges the carbon emissions and reduces system reliability, thus, making workload scheduling an extremely challenging process. The dynamic voltage-frequency scaling (DVFS) technique provides an efficient mechanism in improving the energy efficiency of cloud platform; however, employing DVFS reduces reliability and increases the failure rate of resource scheduling. Most of the current workload scheduling methods have failed to optimize the energy and reliability together under a central processing un

APA, Harvard, Vancouver, ISO, and other styles

5

Fang, Juan, Mengxuan Wang, and Zelin Wei. "A memory scheduling strategy for eliminating memory access interference in heterogeneous system." Journal of Supercomputing 76, no. 4 (2020): 3129–54. http://dx.doi.org/10.1007/s11227-019-03135-7.

Full text

Abstract:

AbstractMultiple CPUs and GPUs are integrated on the same chip to share memory, and access requests between cores are interfering with each other. Memory requests from the GPU seriously interfere with the CPU memory access performance. Requests between multiple CPUs are intertwined when accessing memory, and its performance is greatly affected. The difference in access latency between GPU cores increases the average latency of memory accesses. In order to solve the problems encountered in the shared memory of heterogeneous multi-core systems, we propose a step-by-step memory scheduling strateg

APA, Harvard, Vancouver, ISO, and other styles

6

MERRILL, DUANE, and ANDREW GRIMSHAW. "HIGH PERFORMANCE AND SCALABLE RADIX SORTING: A CASE STUDY OF IMPLEMENTING DYNAMIC PARALLELISM FOR GPU COMPUTING." Parallel Processing Letters 21, no. 02 (2011): 245–72. http://dx.doi.org/10.1142/s0129626411000187.

Full text

Abstract:

The need to rank and order data is pervasive, and many algorithms are fundamentally dependent upon sorting and partitioning operations. Prior to this work, GPU stream processors have been perceived as challenging targets for problems with dynamic and global data-dependences such as sorting. This paper presents: (1) a family of very efficient parallel algorithms for radix sorting; and (2) our allocation-oriented algorithmic design strategies that match the strengths of GPU processor architecture to this genre of dynamic parallelism. We demonstrate multiple factors of speedup (up to 3.8x) compar

APA, Harvard, Vancouver, ISO, and other styles

7

Vilches, Antonio, Rafael Asenjo, Angeles Navarro, Francisco Corbera, Rub́en Gran, and María Garzarán. "Adaptive Partitioning for Irregular Applications on Heterogeneous CPU-GPU Chips." Procedia Computer Science 51 (2015): 140–49. http://dx.doi.org/10.1016/j.procs.2015.05.213.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Sung, Hanul, Hyeonsang Eom, and HeonYoung Yeom. "The Need of Cache Partitioning on Shared Cache of Integrated Graphics Processor between CPU and GPU." KIISE Transactions on Computing Practices 20, no. 9 (2014): 507–12. http://dx.doi.org/10.5626/ktcp.2014.20.9.507.

Full text

APA, Harvard, Vancouver, ISO, and other styles

9

Wang, Shunjiang, Baoming Pu, Ming Li, Weichun Ge, Qianwei Liu, and Yujie Pei. "State Estimation Based on Ensemble DA–DSVM in Power System." International Journal of Software Engineering and Knowledge Engineering 29, no. 05 (2019): 653–69. http://dx.doi.org/10.1142/s0218194019400023.

Full text

Abstract:

This paper investigates the state estimation problem of power systems. A novel, fast and accurate state estimation algorithm is presented to solve this problem based on the one-dimensional denoising autoencoder and deep support vector machine (1D DA–DSVM). Besides, for further reducing the computation burden, a partitioning method is presented to divide the power system into several sub-networks and the proposed algorithm can be applied to each sub-network. A hybrid computing architecture of Central Processing Unit (CPU) and Graphics Processing Unit (GPU) is employed in the overall state estim

APA, Harvard, Vancouver, ISO, and other styles

10

Park, Sungwoo, Seyeon Oh, and Min-Soo Kim. "cuMatch: A GPU-based Memory-Efficient Worst-case Optimal Join Processing Method for Subgraph Queries with Complex Patterns." Proceedings of the ACM on Management of Data 3, no. 3 (2025): 1–28. https://doi.org/10.1145/3725398.

Full text

Abstract:

Subgraph queries are widely used but face significant challenges due to complex patterns such as negative and optional edges. While worst-case optimal joins have proven effective for subgraph queries with regular patterns, no method has been proposed that can process queries involving complex patterns in a single multi-way join. Existing CPU-based and GPU-based methods experience intermediate data explosion when processing complex patterns following regular patterns. In addition, GPU-based methods struggle with issues of wasted GPU memory and redundant computation. In this paper, we propose cu

APA, Harvard, Vancouver, ISO, and other styles

11

Barreiros, Willian, Alba C. M. A. Melo, Jun Kong, et al. "Efficient microscopy image analysis on CPU-GPU systems with cost-aware irregular data partitioning." Journal of Parallel and Distributed Computing 164 (June 2022): 40–54. http://dx.doi.org/10.1016/j.jpdc.2022.02.004.

Full text

APA, Harvard, Vancouver, ISO, and other styles

12

Singh, Amit Kumar, Alok Prakash, Karunakar Reddy Basireddy, Geoff V. Merrett, and Bashir M. Al-Hashimi. "Energy-Efficient Run-Time Mapping and Thread Partitioning of Concurrent OpenCL Applications on CPU-GPU MPSoCs." ACM Transactions on Embedded Computing Systems 16, no. 5s (2017): 1–22. http://dx.doi.org/10.1145/3126548.

Full text

APA, Harvard, Vancouver, ISO, and other styles

13

Hou, Neng, Fazhi He, Yi Zhou, Yilin Chen, and Xiaohu Yan. "A Parallel Genetic Algorithm With Dispersion Correction for HW/SW Partitioning on Multi-Core CPU and Many-Core GPU." IEEE Access 6 (2018): 883–98. http://dx.doi.org/10.1109/access.2017.2776295.

Full text

APA, Harvard, Vancouver, ISO, and other styles

14

Mahmud, Shohaib, Haiying Shen, and Anand Iyer. "PACER: Accelerating Distributed GNN Training Using Communication-Efficient Partition Refinement and Caching." Proceedings of the ACM on Networking 2, CoNEXT4 (2024): 1–18. http://dx.doi.org/10.1145/3697805.

Full text

Abstract:

Despite recent breakthroughs in distributed Graph Neural Network (GNN) training, large-scale graphs still generate significant network communication overhead, decreasing time and resource efficiency. Although recently proposed partitioning or caching methods try to reduce communication inefficiencies and overheads, they are not sufficiently effective due to their sampling pattern-agnostic nature. This paper proposes a Pipelined Partition Aware Caching and Communication Efficient Refinement System (Pacer), a communication-efficient distributed GNN training system. First, Pacer intelligently est

APA, Harvard, Vancouver, ISO, and other styles

15

Chen, Hao, Anqi Wei, and Ye Zhang. "Three-level parallel-set partitioning in hierarchical trees coding based on the collaborative CPU and GPU for remote sensing images compression." Journal of Applied Remote Sensing 11, no. 04 (2017): 1. http://dx.doi.org/10.1117/1.jrs.11.045015.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Wu, Qunyong, Yuhang Wang, Haoyu Sun, Han Lin, and Zhiyuan Zhao. "A System Coupled GIS and CFD for Atmospheric Pollution Dispersion Simulation in Urban Blocks." Atmosphere 14, no. 5 (2023): 832. http://dx.doi.org/10.3390/atmos14050832.

Full text

Abstract:

Atmospheric pollution is a critical issue in public health systems. The simulation of atmospheric pollution dispersion in urban blocks, using CFD, faces several challenges, including the complexity and inefficiency of existing CFD software, time-consuming construction of CFD urban block geometry, and limited visualization and analysis capabilities of simulation outputs. To address these challenges, we have developed a prototype system that couples 3DGIS and CFD for simulating, visualizing, and analyzing atmospheric pollution dispersion. Specifically, a parallel algorithm for coordinate transfo

APA, Harvard, Vancouver, ISO, and other styles

17

Giannoula, Christina, Ivan Fernandez, Juan Gómez Luna, Nectarios Koziris, Georgios Goumas, and Onur Mutlu. "SparseP." Proceedings of the ACM on Measurement and Analysis of Computing Systems 6, no. 1 (2022): 1–49. http://dx.doi.org/10.1145/3508041.

Full text

Abstract:

Several manufacturers have already started to commercialize near-bank Processing-In-Memory (PIM) architectures, after decades of research efforts. Near-bank PIM architectures place simple cores close to DRAM banks. Recent research demonstrates that they can yield significant performance and energy improvements in parallel applications by alleviating data access costs. Real PIM systems can provide high levels of parallelism, large aggregate memory bandwidth and low memory access latency, thereby being a good fit to accelerate the Sparse Matrix Vector Multiplication (SpMV) kernel. SpMV has been

APA, Harvard, Vancouver, ISO, and other styles

18

Kumar, P. S. Jagadeesh, Tracy Lin Huan, and Yang Yung. "Computational Paradigm and Quantitative Optimization to Parallel Processing Performance of Still Image Compression." Circulation in Computer Science 2, no. 4 (2017): 11–17. http://dx.doi.org/10.22632/ccs-2017-252-02.

Full text

Abstract:

Fashionable and staggering evolution in inferring the parallel processing routine coupled with the necessity to amass and distribute huge magnitude of digital records especially still images has fetched an amount of confronts for researchers and other stakeholders. These disputes exorbitantly outlay and maneuvers the digital information among others, subsists the spotlight of the research civilization in topical days and encompasses the lead to the exploration of image compression methods that can accomplish exceptional outcomes. One of those practices is the parallel processing of a diversity

APA, Harvard, Vancouver, ISO, and other styles

19

Duan, Jiaang, Shiyou Qian, Hanwen Hu, Dingyu Yang, Jian Cao, and Guangtao Xue. "PipeCo: Pipelining Cold Start of Deep Learning Inference Services on Serverless Platforms." ACM SIGMETRICS Performance Evaluation Review 53, no. 1 (2025): 151–53. https://doi.org/10.1145/3744970.3727307.

Full text

Abstract:

The fusion of serverless computing and deep learning (DL) has led to serverless inference, offering a promising approach for developing and deploying scalable and cost-efficient deep learning inference services (DLISs). However, the challenge of cold start presents a significant obstacle for DLISs, where DL model size greatly impacts latency. Existing studies mitigate cold starts by extending keep-alive times, which unfortunately leads to decreased resource utilization efficiency. To address this issue, we introduce PipeCo, a system designed to alleviate DLIS cold start. The core concept of Pi

APA, Harvard, Vancouver, ISO, and other styles

20

Duan, Jiaang, Shiyou Qian, Hanwen Hu, Dingyu Yang, Jian Cao, and Guangtao Xue. "PipeCo: Pipelining Cold Start of Deep Learning Inference Services on Serverless Platforms." Proceedings of the ACM on Measurement and Analysis of Computing Systems 9, no. 2 (2025): 1–23. https://doi.org/10.1145/3727125.

Full text

Abstract:

The fusion of serverless computing and deep learning (DL) has led to serverless inference, offering a promising approach for developing and deploying scalable and cost-efficient deep learning inference services (DLISs). However, the challenge of cold start presents a significant obstacle for DLISs, where DL model size greatly impacts latency. Existing studies mitigate cold starts by extending keep-alive times, which unfortunately leads to decreased resource utilization efficiency. To address this issue, we introduce PipeCo, a system designed to alleviate DLIS cold start. The core concept of Pi

APA, Harvard, Vancouver, ISO, and other styles

21

TANAKA, SATOSHI, KYOKO HASEGAWA, SUSUMU NAKATA, et al. "GRID-INDEPENDENT METROPOLIS SAMPLING FOR VOLUME VISUALIZATION." International Journal of Modeling, Simulation, and Scientific Computing 01, no. 02 (2010): 199–218. http://dx.doi.org/10.1142/s1793962310000158.

Full text

Abstract:

We propose a method of sampling regular and irregular-grid volume data for visualization. The method is based on the Metropolis algorithm that is a type of Monte Carlo technique. Our method enables "importance sampling" of local regions of interest in the visualization by generating sample points intensively in regions where a user-specified transfer function takes the peak values. The generated sample-point distribution is independent of the grid structure of the given volume data. Therefore, our method is applicable to irregular grids as well as regular grids. We demonstrate the effectivenes

APA, Harvard, Vancouver, ISO, and other styles

22

Gu, Yufeng, Arun Subramaniyan, Tim Dunn, et al. "GenDP: A Framework of Dynamic Programming Acceleration for Genome Sequencing Analysis." Communications of the ACM 68, no. 05 (2025): 81–90. https://doi.org/10.1145/3712168.

Full text

Abstract:

Genomics is playing an important role in transforming healthcare. Genetic data, however, is being produced at a rate that far outpaces Moore's Law. Many efforts have been made to accelerate genomics kernels on modern commodity hardware, such as CPUs and GPUs, as well as custom accelerators (ASICs) for specific genomics kernels. While ASICs provide higher performance and energy efficiency than general-purpose hardware, they incur a high hardware-design cost. Moreover, to extract the best performance, ASICs tend to have significantly different architectures for different kernels. The divergence

APA, Harvard, Vancouver, ISO, and other styles

23

Bloch, Aurelien, Simone Casale-Brunet, and Marco Mattavelli. "Performance Estimation of High-Level Dataflow Program on Heterogeneous Platforms by Dynamic Network Execution." Journal of Low Power Electronics and Applications 12, no. 3 (2022): 36. http://dx.doi.org/10.3390/jlpea12030036.

Full text

Abstract:

The performance of programs executed on heterogeneous parallel platforms largely depends on the design choices regarding how to partition the processing on the various different processing units. In other words, it depends on the assumptions and parameters that define the partitioning, mapping, scheduling, and allocation of data exchanges among the various processing elements of the platform executing the program. The advantage of programs written in languages using the dataflow model of computation (MoC) is that executing the program with different configurations and parameter settings does n

APA, Harvard, Vancouver, ISO, and other styles

24

Gallet, Benoit, and Michael Gowanlock. "Heterogeneous CPU-GPU Epsilon Grid Joins: Static and Dynamic Work Partitioning Strategies." Data Science and Engineering, October 21, 2020. http://dx.doi.org/10.1007/s41019-020-00145-x.

Full text

Abstract:

Abstract Given two datasets (or tables) A and B and a search distance $$\epsilon$$ ϵ , the distance similarity join, denoted as $$A \ltimes _\epsilon B$$ A ⋉ ϵ B , finds the pairs of points ($$p_a$$ p a , $$p_b$$ p b ), where $$p_a \in A$$ p a ∈ A and $$p_b \in B$$ p b ∈ B , and such that the distance between $$p_a$$ p a and $$p_b$$ p b is $$\le \epsilon$$ ≤ ϵ . If $$A = B$$ A = B , then the similarity join is equivalent to a similarity self-join, denoted as $$A \bowtie _\epsilon A$$ A ⋈ ϵ A . We propose in this paper Heterogeneous Epsilon Grid Joins (HEGJoin), a heterogeneous CPU-GPU distance

APA, Harvard, Vancouver, ISO, and other styles

25

Campos, Cristian, Rafael Asenjo, Javier Hormigo, and Angeles Navarro. "Leveraging SYCL for Heterogeneous cDTW Computation on CPU, GPU, and FPGA." Concurrency and Computation: Practice and Experience 37, no. 15-17 (2025). https://doi.org/10.1002/cpe.70142.

Full text

Abstract:

ABSTRACTOne of the most time‐consuming kernels of a recent epileptic seizure detection application is the computation of the constrained Dynamic Time Warping (cDTW) Distance Matrix. In this paper, we explore the design space of heterogeneous CPU, GPU, and FPGA implementations of this kernel using SYCL as a programming model. First, we optimize the CPU implementation leveraging the SIMD capability of SYCL and compare it with the latest C++26 SIMD library. Next, we tune the SYCL code to run on an on‐chip GPU, iGPU, as well as on a discrete NVIDIA GPU, dGPU. We also develop a SYCL implementation

APA, Harvard, Vancouver, ISO, and other styles

26

Lee, Wan Luan, Dian-Lun Lin, Shui Jiang, et al. "G-kway: Multilevel GPU-Accelerated k-way Graph Partitioner using Task Graph Parallelism." ACM Transactions on Design Automation of Electronic Systems, May 3, 2025. https://doi.org/10.1145/3734522.

Full text

Abstract:

Graph partitioning is important for the design of many CAD algorithms. However, as the graph size continues to grow, graph partitioning becomes increasingly time-consuming. Recent research has introduced parallel graph partitioners using either multi-core CPUs or GPUs. However, the speedup of existing CPU graph partitioners is typically limited to a few cores, while the performance of GPU-based solutions is algorithmically limited by available GPU memory. To overcome these challenges, we propose G-kway, an efficient multilevel GPU-accelerated k -way graph partitioner. G-kway introduces an effe

APA, Harvard, Vancouver, ISO, and other styles

27

Wu, Zhenlin, Haosong Zhao, Hongyuan Liu, Wujie Wen, and Jiajia Li. "gHyPart: GPU-friendly End-to-End Hypergraph Partitioner." ACM Transactions on Architecture and Code Optimization, January 10, 2025. https://doi.org/10.1145/3711925.

Full text

Abstract:

Hypergraph partitioning finds practical applications in various fields, such as high-performance computing and circuit partitioning in VLSI physical design, where high-performance solutions often demand substantial parallelism beyond what existing CPU-based solutions can offer. While GPUs are promising in this regard, their potential in hypergraph partitioning remains unexplored. In this work, we first develop an end-to-end deterministic hypergraph partitioner on GPUs, ported from state-of-the-art multi-threaded CPU work, and identify three major performance challenges by characterizing its pe

APA, Harvard, Vancouver, ISO, and other styles

28

"Improving Processing Speed of Real-Time Stereo Matching using Heterogenous CPU/GPU Model." International Journal of Innovative Technology and Exploring Engineering 9, no. 5 (2020): 1983–87. http://dx.doi.org/10.35940/ijitee.e2982.039520.

Full text

Abstract:

This paper presents an improvement of the processing speed of the stereo matching problem. The time required for stereo matching represents a problem for many real time applications such as robot navigation , self-driving vehicles and object tracking. In this work, a real-time stereo matching system is proposed that utilizes the parallelism of Graphics Processing Unit (GPU). An area based stereo matching system is used to generate the disparity map. Four different sequential and parallel computational models are used to analyze the time consumed by the stereo matching. The models are: 1) Seque

APA, Harvard, Vancouver, ISO, and other styles

29

Lin, Ning, and Venkata Dinavahi. "Parallel High-Fidelity Electromagnetic Transient Simulation of Large-Scale Multi-Terminal DC Grids." November 19, 2018. https://doi.org/10.5281/zenodo.7685832.

Full text

Abstract:

Electromagnetic transient (EMT) simulation of power electronics conducted on the CPU slows down as the system scales up. Thus, the massively parallelism of the graphics processing unit (GPU) is utilized to expedite the simulation of the multi-terminal DC (MTDC) grid, where detailed models of the semiconductor switches are adopted to provide comprehensive device-level information. As the large number of nodes leads to an inefcient solution of the DC grid, three levels of circuit partitioning are applied, i.e., the transmission line-based natural separation of converter stations, splitting of th

APA, Harvard, Vancouver, ISO, and other styles

30

Bloch, Aurelien, Simone Casale-Brunet, and Marco Mattavelli. "Design Space Exploration for Partitioning Dataflow Program on CPU-GPU Heterogeneous System." Journal of Signal Processing Systems, July 31, 2023. http://dx.doi.org/10.1007/s11265-023-01884-6.

Full text

Abstract:

AbstractDataflow programming is a methodology that enables the development of high-level, parametric programs that are independent of the underlying platform. This approach is particularly useful for heterogeneous platforms, as it eliminates the need to rewrite application software for each configuration. Instead, it only requires new low-level implementation code, which is typically automatically generated through code generation tools. The performance of programs running on heterogeneous parallel platforms is highly dependent on the partitioning and mapping of computation to different proces

APA, Harvard, Vancouver, ISO, and other styles

31

Kemmler, Samuel, Christoph Rettinger, Ulrich Rüde, Pablo Cuéllar, and Harald Köstler. "Efficiency and scalability of fully-resolved fluid-particle simulations on heterogeneous CPU-GPU architectures." International Journal of High Performance Computing Applications, January 10, 2025. https://doi.org/10.1177/10943420241313385.

Full text

Abstract:

Current supercomputers often have a heterogeneous architecture using both conventional Central Processing Units (CPUs) and Graphics Processing Units (GPUs). At the same time, numerical simulation tasks frequently involve multiphysics scenarios whose components run on different hardware due to multiple reasons, e.g., architectural requirements, pragmatism, etc. This leads naturally to a software design where different simulation modules are mapped to different subsystems of the heterogeneous architecture. We present a detailed performance analysis for such a hybrid four-way coupled simulation o

APA, Harvard, Vancouver, ISO, and other styles

32

Ali, Teymoor, Deepayan Bhowmik, and Robert Nicol. "Energy aware computer vision algorithm deployment on heterogeneous architectures." Discover Electronics 2, no. 1 (2025). https://doi.org/10.1007/s44291-025-00078-7.

Full text

Abstract:

Abstract Computer vision algorithms, specifically convolutional neural networks (CNNs) and feature extraction algorithms, have become increasingly pervasive in many vision tasks. As algorithm complexity grows, it raises computational and memory requirements, which poses a challenge to embedded vision systems with limited resources. Heterogeneous architectures have recently gained momentum as a new path forward for energy efficiency and faster computation, as they allow for the effective utilisation of various processing units, such as Central Processing Unit (CPU), Graphics Processing Unit (GP

APA, Harvard, Vancouver, ISO, and other styles

33

Shokrani Baigi, Ahmad, Abdorreza Savadi, and Mahmoud Naghibzadeh. "Optimizing sparse matrix partitioning in a heterogeneous CPU-GPU system for high-performance." Computing 107, no. 4 (2025). https://doi.org/10.1007/s00607-025-01456-5.

Full text

APA, Harvard, Vancouver, ISO, and other styles

34

Lin, Ning, and Venkata Dinavahi. "Exact Nonlinear Micromodeling for Fine-Grained Parallel EMT Simulation of MTDC Grid Interaction With Wind Farm." August 1, 2019. https://doi.org/10.5281/zenodo.7683216.

Full text

Abstract:

Detailed high-order models of the insulated gate bipolar transistor (IGBT) and the diode are rarely included in power converters for large-scale system-level electromagnetic transient (EMT) simulation on the CPU, due to the nonlinear characteristics albeit they are more accurate. The massively parallel architecture of the graphics processing unit (GPU) enables a lower computational burden by avoiding the computation of  complex devices repetitively in a sequential manner and thus is utilized in this paper to simulate the wind farm-integrated multiterminal dc (MTdc) grid based on the modul

APA, Harvard, Vancouver, ISO, and other styles

35

THOMAS, beatrice, Roman LE GOFF LATIMIER, Hamid BEN AHMED, Gurvan JODIN, Abdelhafid EL OUARDI, and Samir BOUAZIZ. "Optimized Cpu-Gpu Partitioning for an Admm Algorithm Applied to a Peer to Peer Energy Market." SSRN Electronic Journal, 2022. http://dx.doi.org/10.2139/ssrn.4186889.

Full text

APA, Harvard, Vancouver, ISO, and other styles

36

Mu, Yifei, Ce Yu, Chao Sun, et al. "3DT-CM: A Low-complexity Cross-matching Algorithm for Large Astronomical Catalogues using 3d-tree Approach." Research in Astronomy and Astrophysics, August 8, 2023. http://dx.doi.org/10.1088/1674-4527/acee50.

Full text

Abstract:

Abstract Location-based cross-matching is a preprocessing step in astronomy that aims to identify records belonging to the same celestial body based on the angular distance formula. The traditional approach involves comparing each record in one catalogue with every record in the other catalogue, resulting in a one-to-one comparison with high computational complexity. To reduce the computational time, index partitioning methods are used to divide the sky into regions and perform local cross-matching. In addition, cross-matching algorithms have been adopted on high-performance architectures to i

APA, Harvard, Vancouver, ISO, and other styles

37

Magalhães, W. F., M. C. De Farias, H. M. Gomes, L. B. Marinho, G. S. Aguiar, and P. Silveira. "Evaluating Edge-Cloud Computing Trade-Offs for Mobile Object Detection and Classification with Deep Learning." Journal of Information and Data Management 11, no. 1 (2020). http://dx.doi.org/10.5753/jidm.2020.2026.

Full text

Abstract:

Internet-of-Things (IoT) applications based on Artificial Intelligence, such as mobile object detection and recognition from images and videos, may greatly benefit from inferences made by state-of-the-art Deep Neural Network(DNN) models. However, adopting such models in IoT applications poses an important challenge since DNNs usually require lots of computational resources (i.e. memory, disk, CPU/GPU, and power), which may prevent them to run on resource-limited edge devices. On the other hand, moving the heavy computation to the Cloud may significantly increase running costs and latency of Io

APA, Harvard, Vancouver, ISO, and other styles

38

Sahebi, Amin, Marco Barbone, Marco Procaccini, Wayne Luk, Georgi Gaydadjiev, and Roberto Giorgi. "Distributed large-scale graph processing on FPGAs." Journal of Big Data 10, no. 1 (2023). http://dx.doi.org/10.1186/s40537-023-00756-x.

Full text

Abstract:

AbstractProcessing large-scale graphs is challenging due to the nature of the computation that causes irregular memory access patterns. Managing such irregular accesses may cause significant performance degradation on both CPUs and GPUs. Thus, recent research trends propose graph processing acceleration with Field-Programmable Gate Arrays (FPGA). FPGAs are programmable hardware devices that can be fully customised to perform specific tasks in a highly parallel and efficient manner. However, FPGAs have a limited amount of on-chip memory that cannot fit the entire graph. Due to the limited devic

APA, Harvard, Vancouver, ISO, and other styles

39

Schmidt, Bertil, Felix Kallenborn, Alejandro Chacon, and Christian Hundt. "CUDASW++4.0: ultra-fast GPU-based Smith–Waterman protein sequence database search." BMC Bioinformatics 25, no. 1 (2024). http://dx.doi.org/10.1186/s12859-024-05965-6.

Full text

Abstract:

Abstract Background The maximal sensitivity for local pairwise alignment makes the Smith-Waterman algorithm a popular choice for protein sequence database search. However, its quadratic time complexity makes it compute-intensive. Unfortunately, current state-of-the-art software tools are not able to leverage the massively parallel processing capabilities of modern GPUs with close-to-peak performance. This motivates the need for more efficient implementations. Results CUDASW++4.0 is a fast software tool for scanning protein sequence databases with the Smith-Waterman algorithm on CUDA-enabled GP

APA, Harvard, Vancouver, ISO, and other styles

40

Yanamala, Rama Muni Reddy, and Muralidhar Pullakandam. "Empowering edge devices: FPGA‐based 16‐bit fixed‐point accelerator with SVD for CNN on 32‐bit memory‐limited systems." International Journal of Circuit Theory and Applications, February 13, 2024. http://dx.doi.org/10.1002/cta.3957.

Full text

Abstract:

AbstractConvolutional neural networks (CNNs) are now often used in deep learning and computer vision applications. Its convolutional layer accounts for most calculations and should be computed fast in a local edge device. Field‐programmable gate arrays (FPGAs) have been adequately explored as promising hardware accelerators for CNNs due to their high performance, energy efficiency, and reconfigurability. This paper developed an efficient FPGA‐based 16‐bit fixed‐point hardware accelerator unit for deep learning applications on the 32‐bit low‐memory edge device (PYNQ‐Z2 board). Additionally, sin

APA, Harvard, Vancouver, ISO, and other styles

41

Liu, Chaoqiang, Xiaofei Liao, Long Zheng, et al. "L-FNNG: Accelerating Large-Scale KNN Graph Construction on CPU-FPGA Heterogeneous Platform." ACM Transactions on Reconfigurable Technology and Systems, March 14, 2024. http://dx.doi.org/10.1145/3652609.

Full text

Abstract:

Due to the high complexity of constructing exact k -nearest neighbor graphs, approximate construction has become a popular research topic. The NN-Descent algorithm is one of the representative in-memory algorithms. To effectively handle large datasets, existing state-of-the-art solutions combine the divide-and-conquer approach and the NN-Descent algorithm, where large datasets are divided into multiple partitions, and a subgraph is constructed for each partition before all the subgraphs are merged, reducing the memory pressure significantly. However, such solutions fail to address inefficienci

APA, Harvard, Vancouver, ISO, and other styles

42

Aghapour, Ehsan, Dolly Sapra, Andy Pimentel, and Anuj Pathania. "ARM-CO-UP: ARM CO operative U tilization of P rocessors." ACM Transactions on Design Automation of Electronic Systems, April 8, 2024. http://dx.doi.org/10.1145/3656472.

Full text

Abstract:

HMPSoCs combine different processors on a single chip. They enable powerful embedded devices, which increasingly perform ML inference tasks at the edge. State-of-the-art HMPSoCs can perform on-chip embedded inference using different processors, such as CPUs, GPUs, and NPUs. HMPSoCs can potentially overcome the limitation of low single-processor CNN inference performance and efficiency by cooperative use of multiple processors. However, standard inference frameworks for edge devices typically utilize only a single processor. We present the ARM-CO-UP framework built on the ARM-CL library. The AR

APA, Harvard, Vancouver, ISO, and other styles

43

Vera-Parra, Nelson Enrique, Danilo Alfonso López-Sarmiento, and Cristian Alejandro Rojas-Quintero. "HETEROGENEOUS COMPUTING TO ACCELERATE THE SEARCH OF SUPER K-MERS BASED ON MINIMIZERS." International Journal of Computing, December 30, 2020, 525–32. http://dx.doi.org/10.47839/ijc.19.4.1985.

Full text

Abstract:

The k-mers processing techniques based on partitioning of the data set on the disk using minimizer-type seeds have led to a significant reduction in memory requirements; however, it has added processes (search and distribution of super k-mers) that can be intensive given the large volume of data. This paper presents a massive parallel processing model in order to enable the efficient use of heterogeneous computation to accelerate the search of super k-mers based on seeds (minimizers or signatures). The model includes three main contributions: a new data structure called CISK for representing t

APA, Harvard, Vancouver, ISO, and other styles

44

Karp, Martin, Estela Suarez, Jan H. Meinke, et al. "Experience and analysis of scalable high-fidelity computational fluid dynamics on modular supercomputing architectures." International Journal of High Performance Computing Applications, November 28, 2024. http://dx.doi.org/10.1177/10943420241303163.

Full text

Abstract:

The never-ending computational demand from simulations of turbulence makes computational fluid dynamics (CFD) a prime application use case for current and future exascale systems. High-order finite element methods, such as the spectral element method, have been gaining traction as they offer high performance on both multicore CPUs and modern GPU-based accelerators. In this work, we assess how high-fidelity CFD using the spectral element method can exploit the modular supercomputing architecture at scale through domain partitioning, where the computational domain is split between a Booster modu

APA, Harvard, Vancouver, ISO, and other styles

45

Zhang, Yajie, Ce Yu, Chao Sun, et al. "HLC2: a Highly Efficient Cross-matching Framework for Large Astronomical Catalogues on Heterogeneous Computing Environments." Monthly Notices of the Royal Astronomical Society, January 10, 2023. http://dx.doi.org/10.1093/mnras/stad067.

Full text

Abstract:

Abstract Cross-matching operation, which is to find corresponding data for the same celestial object or region from multiple catalogues, is indispensable to astronomical data analysis and research. Due to the large amount of astronomical catalogues generated by the ongoing and next generation large-scale sky surveys, the time complexity of the cross-matching is increasing dramatically. Heterogeneous computing environments provide a theoretical possibility to accelerate the cross-matching, but the performance advantages of heterogeneous computing resources have not been fully utilized. To meet

APA, Harvard, Vancouver, ISO, and other styles

46

Lin, Ning, and Venkata Dinavahi. "Variable Time-Stepping Modular Multilevel Converter Model for Fast and Parallel Transient Simulation of Multiterminal DC Grid." September 1, 2019. https://doi.org/10.5281/zenodo.7685899.

Full text

Abstract:

The efficiency of multiterminal dc (MTDC) grid simulation decreases with an expansion of its scale and the inclusion of accurate component models. Thus, the variable time-stepping scheme is proposed in this paper to expedite the electromagnetic transient computation. A number of criteria are proposed to evaluate the time-step and regulate it dynamically during simulation. Meanwhile, as the accuracy of results is heavily reliant on the switch model in the modular multilevel converter, the nonlinear behavioral model with a greater accuracy is proposed in addition to the classic ideal model, and

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!