Siga este enlace para ver otros tipos de publicaciones sobre el tema: GPU-CPU.

Artículos de revistas sobre el tema "GPU-CPU"

Crea una cita precisa en los estilos APA, MLA, Chicago, Harvard y otros

Elija tipo de fuente:

Consulte los 50 mejores artículos de revistas para su investigación sobre el tema "GPU-CPU".

Junto a cada fuente en la lista de referencias hay un botón "Agregar a la bibliografía". Pulsa este botón, y generaremos automáticamente la referencia bibliográfica para la obra elegida en el estilo de cita que necesites: APA, MLA, Harvard, Vancouver, Chicago, etc.

También puede descargar el texto completo de la publicación académica en formato pdf y leer en línea su resumen siempre que esté disponible en los metadatos.

Explore artículos de revistas sobre una amplia variedad de disciplinas y organice su bibliografía correctamente.

1

Zhu, Ziyu, Xiaochun Tang, and Quan Zhao. "A unified schedule policy of distributed machine learning framework for CPU-GPU cluster." Xibei Gongye Daxue Xuebao/Journal of Northwestern Polytechnical University 39, no. 3 (2021): 529–38. http://dx.doi.org/10.1051/jnwpu/20213930529.

Texto completo
Resumen
With the widespread using of GPU hardware facilities, more and more distributed machine learning applications have begun to use CPU-GPU hybrid cluster resources to improve the efficiency of algorithms. However, the existing distributed machine learning scheduling framework either only considers task scheduling on CPU resources or only considers task scheduling on GPU resources. Even considering the difference between CPU and GPU resources, it is difficult to improve the resource usage of the entire system. In other words, the key challenge in using CPU-GPU clusters for distributed machine learning jobs is how to efficiently schedule tasks in the job. In the full paper, we propose a CPU-GPU hybrid cluster schedule framework in detail. First, according to the different characteristics of the computing power of the CPU and the computing power of the GPU, the data is divided into data fragments of different sizes to adapt to CPU and GPU computing resources. Second, the paper introduces the task scheduling method under the CPU-GPU hybrid. Finally, the proposed method is verified at the end of the paper. After our verification for K-Means, using the CPU-GPU hybrid computing framework can increase the performance of K-Means by about 1.5 times. As the number of GPUs increases, the performance of K-Means can be significantly improved.
Los estilos APA, Harvard, Vancouver, ISO, etc.
2

Cui, Pengjie, Haotian Liu, Bo Tang, and Ye Yuan. "CGgraph: An Ultra-Fast Graph Processing System on Modern Commodity CPU-GPU Co-processor." Proceedings of the VLDB Endowment 17, no. 6 (2024): 1405–17. http://dx.doi.org/10.14778/3648160.3648179.

Texto completo
Resumen
In recent years, many CPU-GPU heterogeneous graph processing systems have been developed in both academic and industrial to facilitate large-scale graph processing in various applications, e.g., social networks and biological networks. However, the performance of existing systems can be significantly improved by addressing two prevailing challenges: GPU memory over-subscription and efficient CPU-GPU cooperative processing. In this work, we propose CGgraph, an ultra-fast CPU-GPU graph processing system to address these challenges. In particular, CGgraph overcomes GPU-memory over-subscription by extracting a subgraph which only needs to be loaded into GPU memory once, but its vertices and edges can be used in multiple iterations during the graph processing procedure. To support efficient CPU-GPU co-processing, we design a CPU-GPU cooperative processing scheme, which balances the workloads between CPU and GPU by on-demand task allocation. To evaluate the efficiency of CG-graph, we conduct extensive experiments, comparing it with 7 state-of-the-art systems using 4 well-known graph algorithms on 6 real-world graphs. Our prototype system CGgraph outperforms all existing systems, delivering up to an order of magnitude improvement. Moreover, CGgraph on a modern commodity machine with a CPU-GPU co-processor yields superior (or at the very least, comparable) performance compared to existing systems on a high-end CPU-GPU server.
Los estilos APA, Harvard, Vancouver, ISO, etc.
3

Lee, Taekhee, and Young J. Kim. "Massively parallel motion planning algorithms under uncertainty using POMDP." International Journal of Robotics Research 35, no. 8 (2015): 928–42. http://dx.doi.org/10.1177/0278364915594856.

Texto completo
Resumen
We present new parallel algorithms that solve continuous-state partially observable Markov decision process (POMDP) problems using the GPU (gPOMDP) and a hybrid of the GPU and CPU (hPOMDP). We choose the Monte Carlo value iteration (MCVI) method as our base algorithm and parallelize this algorithm using the multi-level parallel formulation of MCVI. For each parallel level, we propose efficient algorithms to utilize the massive data parallelism available on modern GPUs. Our GPU-based method uses the two workload distribution techniques, compute/data interleaving and workload balancing, in order to obtain the maximum parallel performance at the highest level. Here we also present a CPU–GPU hybrid method that takes advantage of both CPU and GPU parallelism in order to solve highly complex POMDP planning problems. The CPU is responsible for data preparation, while the GPU performs Monte Cacrlo simulations; these operations are performed concurrently using the compute/data overlap technique between the CPU and GPU. To the best of the authors’ knowledge, our algorithms are the first parallel algorithms that efficiently execute POMDP in a massively parallel fashion utilizing the GPU or a hybrid of the GPU and CPU. Our algorithms outperform the existing CPU-based algorithm by a factor of 75–99 based on the chosen benchmark.
Los estilos APA, Harvard, Vancouver, ISO, etc.
4

Yogatama, Bobbi W., Weiwei Gong, and Xiangyao Yu. "Orchestrating data placement and query execution in heterogeneous CPU-GPU DBMS." Proceedings of the VLDB Endowment 15, no. 11 (2022): 2491–503. http://dx.doi.org/10.14778/3551793.3551809.

Texto completo
Resumen
There has been a growing interest in using GPU to accelerate data analytics due to its massive parallelism and high memory bandwidth. The main constraint of using GPU for data analytics is the limited capacity of GPU memory. Heterogeneous CPU-GPU query execution is a compelling approach to mitigate the limited GPU memory capacity and PCIe bandwidth. However, the design space of heterogeneous CPU-GPU query execution has not been fully explored. We aim to improve state-of-the-art CPU-GPU data analytics engine by optimizing data placement and heterogeneous query execution. First, we introduce a semantic-aware fine-grained caching policy which takes into account various aspects of the workload such as query semantics, data correlation, and query frequency when determining data placement between CPU and GPU. Second, we introduce a heterogeneous query executor which can fully exploit data in both CPU and GPU and coordinate query execution at a fine granularity. We integrate both solutions in Mordred, our novel hybrid CPU-GPU data analytics engine. Evaluation on the Star Schema Benchmark shows that the semantic-aware caching policy can outperform the best traditional caching policy by up to 3x. Compared to existing GPU DBMSs, Mordred can outperform by an order of magnitude.
Los estilos APA, Harvard, Vancouver, ISO, etc.
5

Raju, K., and Niranjan N Chiplunkar. "PERFORMANCE ENHANCEMENT OF CUDA APPLICATIONS BY OVERLAPPING DATA TRANSFER AND KERNEL EXECUTION." Applied Computer Science 17, no. 3 (2021): 5–18. http://dx.doi.org/10.35784/acs-2021-17.

Texto completo
Resumen
The CPU-GPU combination is a widely used heterogeneous computing system in which the CPU and GPU have different address spaces. Since the GPU cannot directly access the CPU memory, prior to invoking the GPU function the input data must be available on the GPU memory. On completion of GPU function, the results of computation are transferred to CPU memory. The CPU-GPU data transfer happens through PCI-Express bus. The PCI-E bandwidth is much lesser than that of GPU memory. The speed at which the data is transferred is limited by the PCI-E bandwidth. Hence, the PCI-E acts as a performance bottleneck. In this paper two approaches are discussed to minimize the overhead of data transfer, namely, performing the data transfer while the GPU function is being executed and reducing the amount of data to be transferred to GPU. The effectiveness of these approaches on the execution time of a set of CUDA applications is realized using CUDA streams. The results of our experiments show that the execution time of applications can be minimized with the proposed approaches.
Los estilos APA, Harvard, Vancouver, ISO, etc.
6

Power, Jason, Joel Hestness, Marc S. Orr, Mark D. Hill, and David A. Wood. "gem5-gpu: A Heterogeneous CPU-GPU Simulator." IEEE Computer Architecture Letters 14, no. 1 (2015): 34–36. http://dx.doi.org/10.1109/lca.2014.2299539.

Texto completo
Los estilos APA, Harvard, Vancouver, ISO, etc.
7

Abdusalomov, Saidmalikxon Mannop o`g`li. "CPU VA GPU FARQLARI." CENTRAL ASIAN JOURNAL OF EDUCATION AND INNOVATION 2, no. 5 (2023): 168–70. https://doi.org/10.5281/zenodo.7935842.

Texto completo
Los estilos APA, Harvard, Vancouver, ISO, etc.
8

Liu, Gaogao, Wenbo Yang, Peng Li, et al. "MIMO Radar Parallel Simulation System Based on CPU/GPU Architecture." Sensors 22, no. 1 (2022): 396. http://dx.doi.org/10.3390/s22010396.

Texto completo
Resumen
The data volume and computation task of MIMO radar is huge; a very high-speed computation is necessary for its real-time processing. In this paper, we mainly study the time division MIMO radar signal processing flow, propose an improved MIMO radar signal processing algorithm, raising the MIMO radar algorithm processing speed combined with the previous algorithms, and, on this basis, a parallel simulation system for the MIMO radar based on the CPU/GPU architecture is proposed. The outer layer of the framework is coarse-grained with OpenMP for acceleration on the CPU, and the inner layer of fine-grained data processing is accelerated on the GPU. Its performance is significantly faster than the serial computing equipment, and satisfactory acceleration effects have been achieved in the CPU/GPU architecture simulation. The experimental results show that the MIMO radar parallel simulation system with CPU/GPU architecture greatly improves the computing power of the CPU-based method. Compared with the serial sequential CPU method, GPU simulation achieves a speedup of 130 times. In addition, the MIMO radar signal processing parallel simulation system based on the CPU/GPU architecture has a performance improvement of 13%, compared to the GPU-only method.
Los estilos APA, Harvard, Vancouver, ISO, etc.
9

Zou, Yong Ning, Jue Wang, and Jian Wei Li. "Cutting Display of Industrial CT Volume Data Based on GPU." Advanced Materials Research 271-273 (July 2011): 1096–102. http://dx.doi.org/10.4028/www.scientific.net/amr.271-273.1096.

Texto completo
Resumen
The rapid development of Graphic Processor Units (GPU) in recent years in terms of performance and programmability has attracted the attention of those seeking to leverage alternative architectures for better performance than that which commodity CPU can provide. This paper presents a new algorithm for cutting display of computed tomography volume data on the GPU. We first introduce the programming model of the GPU and outline the implementation of techniques for oblique plane cutting display of volume data on both the CPU and GPU. We compare the approaches and present performance results for both the CPU and GPU. The results show that cutting display image generated by GPU algorithm is clear, frame rate on GPU is 2-9 times than that on CPU.
Los estilos APA, Harvard, Vancouver, ISO, etc.
10

Jiang, Ronglin, Shugang Jiang, Yu Zhang, Ying Xu, Lei Xu, and Dandan Zhang. "GPU-Accelerated Parallel FDTD on Distributed Heterogeneous Platform." International Journal of Antennas and Propagation 2014 (2014): 1–8. http://dx.doi.org/10.1155/2014/321081.

Texto completo
Resumen
This paper introduces a (finite difference time domain) FDTD code written in Fortran and CUDA for realistic electromagnetic calculations with parallelization methods of Message Passing Interface (MPI) and Open Multiprocessing (OpenMP). Since both Central Processing Unit (CPU) and Graphics Processing Unit (GPU) resources are utilized, a faster execution speed can be reached compared to a traditional pure GPU code. In our experiments, 64 NVIDIA TESLA K20m GPUs and 64 INTEL XEON E5-2670 CPUs are used to carry out the pure CPU, pure GPU, and CPU + GPU tests. Relative to the pure CPU calculations for the same problems, the speedup ratio achieved by CPU + GPU calculations is around 14. Compared to the pure GPU calculations for the same problems, the CPU + GPU calculations have 7.6%–13.2% performance improvement. Because of the small memory size of GPUs, the FDTD problem size is usually very small. However, this code can enlarge the maximum problem size by 25% without reducing the performance of traditional pure GPU code. Finally, using this code, a microstrip antenna array with16×18elements is calculated and the radiation patterns are compared with the ones of MoM. Results show that there is a well agreement between them.
Los estilos APA, Harvard, Vancouver, ISO, etc.
11

Yogatama, Bobbi, Weiwei Gong, and Xiangyao Yu. "Scaling your Hybrid CPU-GPU DBMS to Multiple GPUs." Proceedings of the VLDB Endowment 17, no. 13 (2024): 4709–22. https://doi.org/10.14778/3704965.3704977.

Texto completo
Resumen
GPU-accelerated databases have been gaining popularity in recent years due to their massive parallelism and high memory bandwidth. The limited GPU memory capacity, however, is still a major bottleneck for GPU databases. Existing approaches have attempted to address this limitation by using (1) hybrid CPU-GPU DBMS or (2) multi-GPU DBMS. We aim to improve prior solutions further by leveraging both hybrid CPU-GPU DBMS and multi-GPU DBMS at the same time. In particular, we explore the design space and optimize the data placement and query execution in hybrid CPU and multi-GPU DBMS. To improve data placement, we introduce the cache-aware replication policy which takes into account the cost of shuffle when replicating data and could coordinate both caching and replication decisions for the best performance. To improve query execution, we extend the existing hybrid CPU-GPU query execution strategy with distributed query processing techniques to support multiple GPUs. We build a system called Lancelot , a hybrid CPU and Multi-GPU data analytics engine with all the optimizations integrated. Our evaluation shows that the cache-aware replication outperforms other policies by up to 2.5× and Lancelot outperforms existing GPU DBMSes by at least 2× on Star Schema Benchmark and 12× on TPC-H Benchmark.
Los estilos APA, Harvard, Vancouver, ISO, etc.
12

Semenenko, Julija, Aliaksei Kolesau, Vadimas Starikovičius, Artūras Mackūnas, and Dmitrij Šešok. "COMPARISON OF GPU AND CPU EFFICIENCY WHILE SOLVING HEAT CONDUCTION PROBLEMS." Mokslas - Lietuvos ateitis 12 (November 24, 2020): 1–5. http://dx.doi.org/10.3846/mla.2020.13500.

Texto completo
Resumen
Overview of GPU usage while solving different engineering problems, comparison between CPU and GPU computations and overview of the heat conduction problem are provided in this paper. The Jacobi iterative algorithm was implemented by using Python, TensorFlow GPU library and NVIDIA CUDA technology. Numerical experiments were conducted with 6 CPUs and 4 GPUs. The fastest used GPU completed the calculations 19 times faster than the slowest CPU. On average, GPU was from 9 to 11 times faster than CPU. Significant relative speed-up in GPU calculations starts when the matrix contains at least 4002 floating-point numbers.
Los estilos APA, Harvard, Vancouver, ISO, etc.
13

Hu, Peng, Zixiong Zhao, Aofei Ji, et al. "A GPU-Accelerated and LTS-Based Finite Volume Shallow Water Model." Water 14, no. 6 (2022): 922. http://dx.doi.org/10.3390/w14060922.

Texto completo
Resumen
This paper presents a GPU (Graphics Processing Unit)-accelerated and LTS (Local-time-Step)-based finite volume Shallow Water Model (SWM). The model performance is compared against the other five model versions (Single CPU versions with/without LTS, Multi-CPU versions with/without LTS, and a GPU version) by simulating three flow scenarios: an idealized dam-break flow; an experimental dam-break flow; a field-scale scenario of tidal flows. Satisfactory agreements between simulation results and the available measured data/reference solutions (water level, flow velocity) indicate that all the six SWM versions can well simulate these challenging shallow water flows. Inter-comparisons of the computational efficiency of the six SWM versions indicate the following. First, GPU acceleration is much more efficient than multi-core CPU parallel computing. Specifically, the speed increase in the GPU can be as high as a hundred, whereas those for multi-core CPU are only 2–3. Second, implementing the LTS can bring considerable reduction: the additional maximum speed-ups can be as high as 10 for the single-core CPU/multi-core CPU versions, and as high as five for the GPU versions. Third, the GPU + LTS version is computationally the most efficient in most cases; the multi-core CPU + LTS version may run as fast as a GPU version for scenarios over some intermediate number of cells.
Los estilos APA, Harvard, Vancouver, ISO, etc.
14

Ai, Xin, Qiange Wang, Chunyu Cao, et al. "NeutronOrch: Rethinking Sample-Based GNN Training under CPU-GPU Heterogeneous Environments." Proceedings of the VLDB Endowment 17, no. 8 (2024): 1995–2008. http://dx.doi.org/10.14778/3659437.3659453.

Texto completo
Resumen
Graph Neural Networks (GNNs) have shown exceptional performance across a wide range of applications. Current frameworks leverage CPU-GPU heterogeneous environments for GNN model training, incorporating mini-batch and sampling techniques to mitigate GPU memory constraints. In such settings, sample-based GNN training can be divided into three phases: sampling, gathering, and training. Existing GNN systems deploy various task orchestration methods to execute each phase on either the CPU or GPU. However, through comprehensive experimentation and analysis, we observe that these task orchestration approaches do not optimally exploit the available heterogeneous resources, hindered by either inefficient CPU processing or GPU resource bottlenecks. In this paper, we propose NeutronOrch, a system for sample-based GNN training that ensures balanced utilization of the CPU and GPU. NeutronOrch decouples the training process by layer and pushes down the training task of the bottom layer to the CPU. This significantly reduces the computational load and memory footprint of GPU training. To avoid inefficient CPU processing, NeutronOrch only offloads the training of frequently accessed vertices to the CPU and lets GPU reuse their embeddings with bounded staleness. Furthermore, NeutronOrch provides a fine-grained pipeline design for the layer-based task orchestrating method. The experimental results show that compared with the state-of-the-art GNN systems, NeutronOrch can achieve up to 11.51× performance speedup.
Los estilos APA, Harvard, Vancouver, ISO, etc.
15

Gyurjyan, Vardan, and Sebastian Mancilla. "Heterogeneous data-processing optimization with CLARA’s adaptive workflow orchestrator." EPJ Web of Conferences 245 (2020): 05020. http://dx.doi.org/10.1051/epjconf/202024505020.

Texto completo
Resumen
The hardware landscape used in HEP and NP is changing from homogeneous multi-core systems towards heterogeneous systems with many different computing units, each with their own characteristics. To achieve maximum performance with data processing, the main challenge is to place the right computing on the right hardware. In this paper, we discuss CLAS12 charge particle tracking workflow orchestration that allows us to utilize both CPU and GPU to improve the performance. The tracking application algorithm was decomposed into micro-services that are deployed on CPU and GPU processing units, where the best features of both are intelligently combined to achieve maximum performance. In this heterogeneous environment, CLARA aims to match the requirements of each micro-service to the strength of a CPU or a GPU architecture. A predefined execution of a micro-service on a CPU or a GPU may not be the most optimal solution due to the streaming data-quantum size and the data-quantum transfer latency between CPU and GPU. So, the CLARA workflow orchestrator is designed to dynamically assign micro-service execution to a CPU or a GPU, based on the online benchmark results analyzed for a period of real-time data-processing.
Los estilos APA, Harvard, Vancouver, ISO, etc.
16

Agibalov, Oleg, and Nikolay Ventsov. "On the issue of fuzzy timing estimations of the algorithms running at GPU and CPU architectures." E3S Web of Conferences 135 (2019): 01082. http://dx.doi.org/10.1051/e3sconf/201913501082.

Texto completo
Resumen
We consider the task of comparing fuzzy estimates of the execution parameters of genetic algorithms implemented at GPU (graphics processing unit’ GPU) and CPU (central processing unit) architectures. Fuzzy estimates are calculated based on the averaged dependencies of the genetic algorithms running time at GPU and CPU architectures from the number of individuals in the populations processed by the algorithm. The analysis of the averaged dependences of the genetic algorithms running time at GPU and CPU-architectures showed that it is possible to process 10’000 chromosomes at GPU-architecture or 5’000 chromosomes at CPUarchitecture by genetic algorithm in approximately 2’500 ms. The following is correct for the cases under consideration: “Genetic algorithms (GA) are performed in approximately 2, 500 ms (on average), ” and a sections of fuzzy sets, with a = 0.5, correspond to the intervals [2, 000.2399] for GA performed at the GPU-architecture, and [1, 400.1799] for GA performed at the CPU-architecture. Thereby, it can be said that in this case, the actual execution time of the algorithm at the GPU architecture deviates in a lesser extent from the average value than at the CPU.
Los estilos APA, Harvard, Vancouver, ISO, etc.
17

Fortin, Pierre, and Maxime Touche. "Dual tree traversal on integrated GPUs for astrophysical N-body simulations." International Journal of High Performance Computing Applications 33, no. 5 (2019): 960–72. http://dx.doi.org/10.1177/1094342019840806.

Texto completo
Resumen
In astrophysical N-body simulations, O( N) fast multipole methods (FMMs) with dual tree traversal (DTT) on multi-core CPUs are faster than O( N log N) CPU tree-codes but can still be outperformed by GPU ones. In this article, we aim at combining the best algorithm, namely FMM with DTT, with the most powerful hardware currently available, namely GPUs. In the astrophysical context requiring low accuracies and non-uniform particle distributions, we show that such combination can be achieved thanks to a hybrid CPU-GPU algorithm on integrated GPUs: while the DTT is performed on the CPU cores, the far- and near-field computations are all performed on the GPU cores. We show how to efficiently expose the interactions resulting from the DTT to the GPU cores, how to deploy both the far- and near-field computations on GPU, and how to overlap the parallel DTT on CPU with GPU computations. Based on the falcON code and using OpenCL on AMD Accelerated Processing Units and on Intel integrated GPUs, this first heterogeneous deployment of DTT for FMM outperforms standard multi-core CPUs and matches GPU and high-end CPU performance, being hence more cost- and power-efficient.
Los estilos APA, Harvard, Vancouver, ISO, etc.
18

Liu, Changyuan. "Study on the Particle Sorting Performance for Reactor Monte Carlo Neutron Transport on Apple Unified Memory GPUs." EPJ Web of Conferences 302 (2024): 04001. http://dx.doi.org/10.1051/epjconf/202430204001.

Texto completo
Resumen
In simulation of nuclear reactor physics using the Monte Carlo neutron transport method on GPUs, the sorting of particles plays a significant role in performance of calculation. Traditionally, CPUs and GPUs are separated devices connected at low data transfer rate and high data transfer latency. Emerging computing chips tend to integrate CPUs and GPUs. One example is the Apple silicon chips with unified memory. Such unified memory chips have opened doors for new strategies of collaboration between CPUs and GPUs for Monte Carlo neutron transport. Sorting particles on CPU and transport on GPU is an example of such new strategy, which has been suffering the high CPU-GPU data transfer latency on the traditional devices with separated CPU and GPU. The finding is that for the Apple M2 max and M3 max chip, sorting on CPU leads to better performance per power than sorting on GPU for the ExaSMR whole core benchmark problems and the HTR-10 high temperature gas reactor fuel pebble problem. The partially sorted particle order has been identified to contribute to the higher performance with CPU sort than GPU. The in-house code using both CPU and GPU achieves 7.6 times (M3 max) power efficiency that of OpenMC on CPU for ExaSMR whole core benchmark with depleted fuel, and 130 times (M3 max) for HTR-10 fuel pebble benchmark with depleted fuel.
Los estilos APA, Harvard, Vancouver, ISO, etc.
19

Cao, Wei, Zheng Hua Wang, and Chuan Fu Xu. "An Out-of-Core Method for CFD Simulation in Heterogeneous Environment." Advanced Materials Research 753-755 (August 2013): 2912–15. http://dx.doi.org/10.4028/www.scientific.net/amr.753-755.2912.

Texto completo
Resumen
In recent years, the highly parallel graphics processing unit (GPU) is rapidly gaining maturity as a powerful engine for high performance computer. However, in most computational fluid dynamics (CFD) simulations, the computational capacity of CPU was ignored. In this paper, we propose a hybrid parallel programming model to utilize the computational capacity of both CPU and GPU. Considering the memory amount of CPU and GPU, we also propose an out-of-core method to increase the simulation scale on single node. The experiment results show that the programming model can utilize the computational capacity of both CPU and GPU efficiently and the out-of-core method can increase the simulation scale on single node.
Los estilos APA, Harvard, Vancouver, ISO, etc.
20

Yang, Min Kyu, and Jae-Seung Jeong. "Optimized Hybrid Central Processing Unit–Graphics Processing Unit Workflow for Accelerating Advanced Encryption Standard Encryption: Performance Evaluation and Computational Modeling." Applied Sciences 15, no. 7 (2025): 3863. https://doi.org/10.3390/app15073863.

Texto completo
Resumen
This study addresses the growing demand for scalable data encryption by evaluating the performance of AES (Advanced Encryption Standard) encryption and decryption using CBC (Cipher Block Chaining) and CTR (Counter Mode) modes across various CPU (Central Processing Unit) and GPU (Graphics Processing Unit) hardware models. The objective is to highlight GPU acceleration benefits and propose an optimized hybrid CPU–GPU workflow for large-scale data security. Methods include benchmarking encryption performance with provided data, mathematical models, and computational analysis. The results indicate significant performance gains with GPU acceleration, particularly for large datasets, and demonstrate that the hybrid CPU–GPU approach balances speed and resource utilization efficiently.
Los estilos APA, Harvard, Vancouver, ISO, etc.
21

Shim, Hyungwook, Myeongju Ko, and Minho Seo. "Decomposition analysis of influencing factors of GPU-centric supercomputing demand: LMDI-based approach." Edelweiss Applied Science and Technology 9, no. 2 (2025): 208–17. https://doi.org/10.55214/25768484.v9i2.4455.

Texto completo
Resumen
With the introduction of AI technology, the supercomputing industry is transitioning from CPU-centric to GPU-centric, and many countries are making efforts to build new GPU-centric resources. The purpose of this paper is to discover new factors in demand management for efficient construction and operation of future national supercomputing GPU resources. Reflecting industry characteristics, we decompose the factors affecting existing CPU use into intensity effect, structure effect, and production effect indicators targeting CPU-only resources and GPU-only resources, and compare and analyze the influence of each factor. To estimate the influence of each factor, the Logarithmic Mean Divisia Index methodology was used, and annual CPU usage data from the Republic of Korea's national supercomputing center was used. As a result of the analysis, it was confirmed that CPU resources show a similar trend every year, and that the effects of the intensity and production indicators are continuously increasing. In the case of GPU resources, all indicators had an influence in the direction of increasing demand, and it was confirmed that the information/communication field was overwhelmingly showing the greatest effect.
Los estilos APA, Harvard, Vancouver, ISO, etc.
22

Chad Ferrino, Abuda, and Tae Young Choe. "Efficient Deep Learning Job Allocation in Cloud Systems by Predicting Resource Consumptions including GPU and CPU." Tehnički glasnik 19, no. 3 (2025): 461–72. https://doi.org/10.31803/tg-20240112104444.

Texto completo
Resumen
One objective of GPU scheduling in cloud systems is to minimize the completion times of given deep learning models. This is important for deep learning in cloud environments because deep learning workloads require a lot of time to finish, and misallocation of these workloads can create a huge increase in job completion time. Difficulties of GPU scheduling come from a diverse type of parameters including model architectures and GPU types. Some of these model architectures are CPU-intensive rather than GPU intensive which creates a different hardware requirement when training different models. The previous GPU scheduling research had used a small set of parameters, which did not include CPU parameters, which made it difficult to reduce the job completion time (JCT). This paper introduces an improved GPU scheduling approach that reduces job completion time by predicting execution time and various resource consumption parameters including GPU Utilization%, GPU Memory Utilization%, GPU Memory, and CPU Utilization%. The experimental results show that the proposed model improves JCT by up to 40.9% on GPU Allocation based on Computing Efficiency compared to Driple.
Los estilos APA, Harvard, Vancouver, ISO, etc.
23

Tang, Wenjie, Wentong Cai, Yiping Yao, Xiao Song, and Feng Zhu. "An alternative approach for collaborative simulation execution on a CPU+GPU hybrid system." SIMULATION 96, no. 3 (2019): 347–61. http://dx.doi.org/10.1177/0037549719885178.

Texto completo
Resumen
In the past few years, the graphics processing unit (GPU) has been widely used to accelerate time-consuming models in simulations. Since both model computation and simulation management are main factors that affect the performance of large-scale simulations, only accelerating model computation will limit the potential speedup. Moreover, models that can be well accelerated by a GPU could be insufficient, especially for simulations with many lightweight models. Traditionally, the parallel discrete event simulation (PDES) method is used to solve this class of simulation, but most PDES simulators only utilize the central processing unit (CPU) even though the GPU is commonly available now. Hence, we propose an alternative approach for collaborative simulation execution on a CPU+GPU hybrid system. The GPU supports both simulation management and model computation as CPUs. A concurrency-oriented scheduling algorithm was proposed to enable cooperation between the CPU and the GPU, so that multiple computation and communication resources can be efficiently utilized. In addition, GPU functions have also been carefully designed to adapt the algorithm. The combination of those efforts allows the proposed approach to achieve significant speedup compared to the traditional PDES on a CPU.
Los estilos APA, Harvard, Vancouver, ISO, etc.
24

Hadi, N. A., S. A. Halim, N. S. M. Lazim, and N. Alias. "Performance of CPU GPU Parallel Architecture on Segmentation and Geometrical Features Extraction of Malaysian Herb Leaves." Malaysian Journal of Mathematical Sciences 16, no. 2 (2022): 363–77. http://dx.doi.org/10.47836/mjms.16.2.12.

Texto completo
Resumen
Image recognition includes the segmentation of image boundary geometrical features extraction and classification is used in the particular image database development. The ultimate challenge in this task is it is computationally expensive. This paper highlighted a CPU GPU architecture for image segmentation and features extraction processes of 125 images of Malaysian Herb Leaves. Two GPUs and three kernels are utilized in the CPU GPU platform using MATLAB software. Each of herb image has pixel dimensions 16161080. The segmentation process uses the Sobel operator which is then used to extract the boundary points. Finally seven geometrical features are extracted for each image. Both processes are first executed on the CPU alone before bringing it onto a CPU GPU platform to accelerate the computational performance. The results show that the developed CPU GPU platform has accelerated the computation process by a factor of 4.13. However the efficiency shows a decline which suggests that the processors utilization must be improved in the future to balance the load distribution.
Los estilos APA, Harvard, Vancouver, ISO, etc.
25

CHEN, LIN, DESHI YE, and GUOCHUAN ZHANG. "ONLINE SCHEDULING OF MIXED CPU-GPU JOBS." International Journal of Foundations of Computer Science 25, no. 06 (2014): 745–61. http://dx.doi.org/10.1142/s0129054114500312.

Texto completo
Resumen
We consider the online scheduling problem in a CPU-GPU cluster. In this problem there are two sets of processors, the CPU processors and the GPU processors. Each job has two distinct processing times, one for the CPU processor and the other for the GPU processor. Once a job is released, a decision should be made immediately about which processor it should be assigned to. The goal is to minimize the makespan, i.e., the largest completion time among all the processors. Such a problem could be seen as an intermediate model between the scheduling problem on identical machines and unrelated machines. We provide a 3.85-competitive online algorithm for this problem and show that no online algorithm exists with competitive ratio strictly less than 2. We also consider two special cases of this problem, the balanced case where the number of CPU processors equals to that of GPU processors, and the one-sided case where there is only one CPU or GPU processor. For the balanced case, we first provide a simple 3-competitive algorithm, and then a better algorithm with competitive ratio of 2.732 is derived. For the one-sided case, a 3-competitive algorithm is given.
Los estilos APA, Harvard, Vancouver, ISO, etc.
26

Liu, Zhi Yuan, and Xue Zhang Zhao. "Research and Implementation of Image Rotation Based on CUDA." Advanced Materials Research 216 (March 2011): 708–12. http://dx.doi.org/10.4028/www.scientific.net/amr.216.708.

Texto completo
Resumen
GPU technology release CPU from burdensome graphic computing task. The nVIDIA company, the main GPU producer, adds CUDA technology in new GPU models which enhances GPU function greatly and has much advantage in computing complex matrix. General algorithms of image rotation and the structure of CUDA are introduced in this paper. An example of rotating an image by using HALCON based on CPU instruction extensions and CUDA technology is to prove the advantage of CUDA by comparing two results.
Los estilos APA, Harvard, Vancouver, ISO, etc.
27

Tao, Yu-Bo, Hai Lin, and Hu Jun Bao. "FROM CPU TO GPU: GPU-BASED ELECTROMAGNETIC COMPUTING (GPUECO)." Progress In Electromagnetics Research 81 (2008): 1–19. http://dx.doi.org/10.2528/pier07121302.

Texto completo
Los estilos APA, Harvard, Vancouver, ISO, etc.
28

Ma, Haifeng. "Development of a CPU-GPU heterogeneous platform based on a nonlinear parallel algorithm." Nonlinear Engineering 11, no. 1 (2022): 215–22. http://dx.doi.org/10.1515/nleng-2022-0027.

Texto completo
Resumen
Abstract In order to seek a refined model analysis software platform that can balance both the computational accuracy and computational efficiency, a CPU-GPU heterogeneous platform based on a nonlinear parallel algorithm is developed. The modular design method is adopted to complete the architecture construction of structural nonlinear analysis software, clarify the basic analysis steps of nonlinear finite element problems, so as to determine the structure of the software system, conduct module division, and clarify the function, interface, and call relationship of each module. The results show that when the number of model layers is 10, the GPU is 210.5/s and the CPU is 1073.2/s, and the computational time of the GPU is significantly better, with an acceleration ratio of 5.1. For all the models, the GPU calculation time is much less than that of the CPU, and when the number of model degrees of freedom increases, the acceleration effect of the GPU becomes more obvious. Therefore, the CPU-GPU heterogeneous platform can more accurately describe the nonlinear behavior in the complex stress states of the shear walls, and is computationally efficient.
Los estilos APA, Harvard, Vancouver, ISO, etc.
29

Silva, Bruno, Luiz Guerreiro Lopes, and Fábio Mendonça. "Multithreaded and GPU-Based Implementations of a Modified Particle Swarm Optimization Algorithm with Application to Solving Large-Scale Systems of Nonlinear Equations." Electronics 14, no. 3 (2025): 584. https://doi.org/10.3390/electronics14030584.

Texto completo
Resumen
This paper presents a novel Graphics Processing Unit (GPU) accelerated implementation of a modified Particle Swarm Optimization (PSO) algorithm specifically designed to solve large-scale Systems of Nonlinear Equations (SNEs). The proposed GPU-based parallel version of the PSO algorithm uses the inherent parallelism of modern hardware architectures. Its performance is compared against both sequential and multithreaded Central Processing Unit (CPU) implementations. The primary objective is to evaluate the efficiency and scalability of PSO across different hardware platforms with a focus on solving large-scale SNEs involving thousands of equations and variables. The GPU-parallelized and multithreaded versions of the algorithm were implemented in the Julia programming language. Performance analyses were conducted on an NVIDIA A100 GPU and an AMD EPYC 7643 CPU. The tests utilized a set of challenging, scalable SNEs with dimensions ranging from 1000 to 5000. Results demonstrate that the GPU accelerated modified PSO substantially outperforms its CPU counterparts, achieving substantial speedups and consistently surpassing the highly optimized multithreaded CPU implementation in terms of computation time and scalability as the problem size increases. Therefore, this work evaluates the trade-offs between different hardware platforms and underscores the potential of GPU-based parallelism for accelerating SNE solvers.
Los estilos APA, Harvard, Vancouver, ISO, etc.
30

Woźniak, Jarosław. "Wykorzystanie CPU i GPU do obliczeń w Matlabie." Journal of Computer Sciences Institute 10 (March 30, 2019): 32–35. http://dx.doi.org/10.35784/jcsi.191.

Texto completo
Resumen
W artykule zostały przedstawione wybrane rozwiązania wykorzystujące procesory CPU oraz procesory graficzne GPU do obliczeń w środowisku Matlab. Porównywano różne metody wykonywania obliczeń na CPU, jak i na GPU. Zostały wskazane różnice, wady, zalety oraz skutki stosowania wybranych sposobów obliczeń.
Los estilos APA, Harvard, Vancouver, ISO, etc.
31

Janiak, Adam, Wladyslaw Janiak, and Maciej Lichtenstein. "Tabu Search on GPU." JUCS - Journal of Universal Computer Science 14, no. (14) (2008): 2416–27. https://doi.org/10.3217/jucs-014-14-2416.

Texto completo
Resumen
Nowadays Personal Computers (PCs) are often equipped with powerful, multi-core CPU. However, the processing power of the modern PC does not depend only of the processing power of the CPU and can be increased by proper use of the GPGPU, i.e. General-Purpose Computation Using Graphics Hardware. Modern graphics hardware, initially developed for computer graphics generation, appeared to be flexible enough for general-purpose computations. In this paper we present the implementation of two optimization algorithms based on the tabu search technique, namely for the traveling salsesman problem and the flow shop scheduling problem. Both algorithms are implemented in two versions and utilize, respectively, multi-core CPU, and GPU. The extensive numerical experiments confirm the high computation power of GPU and show that tabu search algorithm run on modern GPU can be even 16 times faster than run on modern CPU.
Los estilos APA, Harvard, Vancouver, ISO, etc.
32

Yoo, Seohwan, Sunjun Hwang, Hayeon Park, Jin Choi, and Chang-Gun Lee. "Hardware Interrupt-Aware CPU/GPU Scheduling on Heterogeneous Multicore and GPU System." KIISE Transactions on Computing Practices 29, no. 1 (2023): 10–14. http://dx.doi.org/10.5626/ktcp.2022.29.1.10.

Texto completo
Los estilos APA, Harvard, Vancouver, ISO, etc.
33

Ayush, Bhardwaj, and B. Ramesh K. "Designing a Graphics Processing Unit with advanced Arithmetic Logic Unit Resulting Improved Performance." Research and Applications: Emerging Technologies 6, no. 3 (2024): 38–46. https://doi.org/10.5281/zenodo.12720907.

Texto completo
Resumen
<em>This paper explores microprocessor intricacies, particularly the central processing unit (CPU) and the graphics processing unit (GPU). The CPU, dubbed a computer's brain, features critical components like the Control Unit (CU), Arithmetic Logic Unit (ALU), and Memory Unit (MU), orchestrating instruction execution and system resource management. Contrarily, GPUs, initially for graphics rendering, now excel in parallel processing, aiding tasks beyond graphics. It compares CPU and GPU architectures, emphasizing their parallel processing and memory hierarchy. The graphics rendering pipeline's stages are delineated, illustrating 3D scene conversion to 2D images. GPU performance optimization methods, notably pipeline instructions, are discussed for significant performance enhancements, offering higher throughput, reduced latency, and improved efficiency. Through empirical evidence, it concludes pipeline instructions significantly boost GPU performance, advancing computing capabilities. This research illuminates pipeline instructions' pivotal role in GPU performance enhancement, driving modern computing advancement.</em>
Los estilos APA, Harvard, Vancouver, ISO, etc.
34

Wang, Qihan, Zhen Peng, Bin Ren, Jie Chen, and Robert G. Edwards. "MemHC: An Optimized GPU Memory Management Framework for Accelerating Many-body Correlation." ACM Transactions on Architecture and Code Optimization 19, no. 2 (2022): 1–26. http://dx.doi.org/10.1145/3506705.

Texto completo
Resumen
The many-body correlation function is a fundamental computation kernel in modern physics computing applications, e.g., Hadron Contractions in Lattice quantum chromodynamics (QCD). This kernel is both computation and memory intensive, involving a series of tensor contractions, and thus usually runs on accelerators like GPUs. Existing optimizations on many-body correlation mainly focus on individual tensor contractions (e.g., cuBLAS libraries and others). In contrast, this work discovers a new optimization dimension for many-body correlation by exploring the optimization opportunities among tensor contractions. More specifically, it targets general GPU architectures (both NVIDIA and AMD) and optimizes many-body correlation’s memory management by exploiting a set of memory allocation and communication redundancy elimination opportunities: first, GPU memory allocation redundancy : the intermediate output frequently occurs as input in the subsequent calculations; second, CPU-GPU communication redundancy : although all tensors are allocated on both CPU and GPU, many of them are used (and reused) on the GPU side only, and thus, many CPU/GPU communications (like that in existing Unified Memory designs) are unnecessary; third, GPU oversubscription: limited GPU memory size causes oversubscription issues, and existing memory management usually results in near-reuse data eviction, thus incurring extra CPU/GPU memory communications. Targeting these memory optimization opportunities, this article proposes MemHC, an optimized systematic GPU memory management framework that aims to accelerate the calculation of many-body correlation functions utilizing a series of new memory reduction designs. These designs involve optimizations for GPU memory allocation, CPU/GPU memory movement, and GPU memory oversubscription, respectively. More specifically, first, MemHC employs duplication-aware management and lazy release of GPU memories to corresponding host managing for better data reusability. Second, it implements data reorganization and on-demand synchronization to eliminate redundant (or unnecessary) data transfer. Third, MemHC exploits an optimized Least Recently Used (LRU) eviction policy called Pre-Protected LRU to reduce evictions and leverage memory hits. Additionally, MemHC is portable for various platforms including NVIDIA GPUs and AMD GPUs. The evaluation demonstrates that MemHC outperforms unified memory management by \( 2.18\times \) to \( 10.73\times \) . The proposed Pre-Protected LRU policy outperforms the original LRU policy by up to \( 1.36\times \) improvement. 1
Los estilos APA, Harvard, Vancouver, ISO, etc.
35

Borcovas, Evaldas, and Gintautas Daunys. "CPU AND GPU (CUDA) TEMPLATE MATCHING COMPARISON / CPU IR GPU (CUDA) PALYGINIMAS VYKDANT ŠABLONŲ ATITIKTIES ALGORITMĄ." Mokslas – Lietuvos ateitis 6, no. 2 (2014): 129–33. http://dx.doi.org/10.3846/mla.2014.16.

Texto completo
Resumen
Image processing, computer vision or other complicated opticalinformation processing algorithms require large resources. It isoften desired to execute algorithms in real time. It is hard tofulfill such requirements with single CPU processor. NVidiaproposed CUDA technology enables programmer to use theGPU resources in the computer. Current research was madewith Intel Pentium Dual-Core T4500 2.3 GHz processor with4 GB RAM DDR3 (CPU I), NVidia GeForce GT320M CUDAcompliable graphics card (GPU I) and Intel Core I5-2500K3.3 GHz processor with 4 GB RAM DDR3 (CPU II), NVidiaGeForce GTX 560 CUDA compatible graphic card (GPU II).Additional libraries as OpenCV 2.1 and OpenCV 2.4.0 CUDAcompliable were used for the testing. Main test were made withstandard function MatchTemplate from the OpenCV libraries.The algorithm uses a main image and a template. An influenceof these factors was tested. Main image and template have beenresized and the algorithm computing time and performancein Gtpix/s have been measured. According to the informationobtained from the research GPU computing using the hardwarementioned earlier is till 24 times faster when it is processing abig amount of information. When the images are small the performanceof CPU and GPU are not significantly different. Thechoice of the template size makes influence on calculating withCPU. Difference in the computing time between the GPUs canbe explained by the number of cores which they have. Vaizdų apdorojimas, kompiuterinė rega ir kiti sudėtingi algoritmai, apdorojantys optinę informaciją, naudoja dideliusskaičiavimo išteklius. Dažnai šiuos algoritmus reikia realizuoti realiuoju laiku. Šį uždavinį išspręsti naudojant tik vienoCPU (angl. Central processing unit) pajėgumus yra sudėtinga. nVidia pasiūlyta CUDA (angl. Compute unified device architecture)technologija leidžia panaudoti GPU (angl. Graphic processing unit) išteklius. Tyrimui atlikti buvo pasirinkti du skirtingiCPU: Intel Pentium Dual-Core T4500 ir Intel Core I5 2500K, bei GPU: nVidia GeForce GT320M ir NVidia GeForce 560.Tyrime buvo panaudotos vaizdų apdorojimo bibliotekos: OpenCV 2.1 ir OpenCV 2.4. Tyrimui buvo pasirinktas šablonų atitiktiesalgoritmas. Algoritmui realizuoti reikalingas analizuojamas vaizdas ir ieškomo objekto vaizdo šablonas. Tyrimo metu buvokeičiamas vaizdo ir šablono dydis bei stebima, kaip tai veikia algoritmo vykdymo trukmę ir vykdomų operacijų skaičių persekundę. Iš gautų rezultatų galima teigti, kad apdorojant didelį duomenų kiekį GPU realizuoja algoritmą iki 24 kartų greičiaunei tik CPU. Dirbant su nedideliu duomenų kiekiu, skirtumas tarp CPU ir GPU yra minimalus. Lyginant skaičiavimus dviejuoseGPU, pastebėta, kad skaičiavimų sparta yra tiesiogiai proporcinga GPU turimų branduolių kiekiui. Mūsų tyrimo atvejuspartesniame GPU jų buvo 16 kartų daugiau, tad ir skaičiavimai vyko 16 kartų sparčiau.
Los estilos APA, Harvard, Vancouver, ISO, etc.
36

Paul, Indrani, Vignesh Ravi, Srilatha Manne, Manish Arora, and Sudhakar Yalamanchili. "Coordinated Energy Management in Heterogeneous Processors." Scientific Programming 22, no. 2 (2014): 93–108. http://dx.doi.org/10.1155/2014/210762.

Texto completo
Resumen
This paper examines energy management in a heterogeneous processor consisting of an integrated CPU–GPU for high-performance computing (HPC) applications. Energy management for HPC applications is challenged by their uncompromising performance requirements and complicated by the need for coordinating energy management across distinct core types – a new and less understood problem. We examine the intra-node CPU–GPU frequency sensitivity of HPC applications on tightly coupled CPU–GPU architectures as the first step in understanding power and performance optimization for a heterogeneous multi-node HPC system. The insights from this analysis form the basis of a coordinated energy management scheme, called DynaCo, for integrated CPU–GPU architectures. We implement DynaCo on a modern heterogeneous processor and compare its performance to a state-of-the-art power- and performance-management algorithm. DynaCo improves measured average energy-delay squared (ED2) product by up to 30% with less than 2% average performance loss across several exascale and other HPC workloads.
Los estilos APA, Harvard, Vancouver, ISO, etc.
37

Wang, Zhe, Yao Shen, and Zhou Lei. "EGA: An Efficient GPU Accelerated Groupby Aggregation Algorithm." Applied Sciences 15, no. 7 (2025): 3693. https://doi.org/10.3390/app15073693.

Texto completo
Resumen
With the exponential growth of big data, efficient groupby aggregation (GA) has become critical for real-time analytics across industries. GA is a key method for extracting valuable information. Current CPU-based solutions (such as large-scale parallel processing platforms) face computational throughput limitations. Since CPU-based platforms struggle to support real-time big data analysis, the GPU is introduced to support real-time GA analysis. Most GPU GA algorithms are based on hashing methods, and these algorithms experience performance degradation when the load factor of the hash table is too high or when the data volume exceeds the GPU memory capacity limit. This paper proposes an efficient hash-based GPU-accelerated groupby aggregation algorithm (EGA) that addresses these limitations. EGA features different designs for different scenarios: single-pass EGA (SP-EGA) maintains high efficiency when data fit in the GPU memory, while multipass EGA (MP-EGA) supports GA for data exceeding the GPU memory capacity. EGA demonstrates significant acceleration: SP-EGA outperforms SOTA hash-based GPU algorithms by 1.16–5.39× at load factors &gt;0.90 and surpasses SOTA sort-based GPU methods by 1.30–2.48×. MP-EGA achieves 6.45–29.12× speedup over SOTA CPU implementations.
Los estilos APA, Harvard, Vancouver, ISO, etc.
38

Campeanu, Gabriel, and Mehrdad Saadatmand. "A Two-Layer Component-Based Allocation for Embedded Systems with GPUs." Designs 3, no. 1 (2019): 6. http://dx.doi.org/10.3390/designs3010006.

Texto completo
Resumen
Component-based development is a software engineering paradigm that can facilitate the construction of embedded systems and tackle its complexities. The modern embedded systems have more and more demanding requirements. One way to cope with such a versatile and growing set of requirements is to employ heterogeneous processing power, i.e., CPU–GPU architectures. The new CPU–GPU embedded boards deliver an increased performance but also introduce additional complexity and challenges. In this work, we address the component-to-hardware allocation for CPU–GPU embedded systems. The allocation for such systems is much complex due to the increased amount of GPU-related information. For example, while in traditional embedded systems the allocation mechanism may consider only the CPU memory usage of components to find an appropriate allocation scheme, in heterogeneous systems, the GPU memory usage needs also to be taken into account in the allocation process. This paper aims at decreasing the component-to-hardware allocation complexity by introducing a two-layer component-based architecture for heterogeneous embedded systems. The detailed CPU–GPU information of the system is abstracted at a high-layer by compacting connected components into single units that behave as regular components. The allocator, based on the compacted information received from the high-level layer, computes, with a decreased complexity, feasible allocation schemes. In the last part of the paper, the two-layer allocation method is evaluated using an existing embedded system demonstrator; namely, an underwater robot.
Los estilos APA, Harvard, Vancouver, ISO, etc.
39

Handa, Pooja, Meenu Kalra, and Rajesh Sachdeva. "A Survey on Green Computing using GPU in Image Processing." INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY 14, no. 10 (2015): 6135–41. http://dx.doi.org/10.24297/ijct.v14i10.1834.

Texto completo
Resumen
Green computing is the process of reducing the power consumed by a computer and thereby reducing carbon emissions. The total power consumed by the computer excluding the monitor at its fully computative load is equal to the sum of the power consumed by the GPU in its idle state and the CPU at its full state. Recently, there have been tremendous interests in the acceleration of general computing applications using a Graphics Processing Unit (GPU). Now the GPU provides the computing powers not only for fast processing of graphics applications, but also for general computationally complex data intensive applications. On the other hand, power and energy consumptions are also becoming important design criteria. Consequently, software designs have to consider the power/energy consumptions together with performance when they are developing software.The GPU therefore does the 100% of the CPU work in its idle state .Hence the power consumed by the GPU will be low. Also when the GPU is doing all the work the CPU will remain at a load less than its idle load. Hence the power consumed will be equal to the power consumed by the CPU at a load less than its idle load plus the power consumed by a GPU. Â
Los estilos APA, Harvard, Vancouver, ISO, etc.
40

Ding, Li, Zhaomiao Dong, Huagang He, and Qibin Zheng. "A Hybrid GPU and CPU Parallel Computing Method to Accelerate Millimeter-Wave Imaging." Electronics 12, no. 4 (2023): 840. http://dx.doi.org/10.3390/electronics12040840.

Texto completo
Resumen
The range migration algorithm (RMA) based on Fourier transformation is widely applied in millimeter-wave (MMW) close-range imaging because of its few operations and small approximation. However, its interpolation stage is not effective due to the involved intensive logic controls, which limits the speed performance in a graphics processing unit (GPU) platform. Therefore, in this paper, we present an acceleration optimization method based on the hybrid GPU and central processing unit (CPU) parallel computation for implementing the RMA. The proposed method exploits the strong logic-control capability of the CPU to assist the GPU in processing the logic controls of the interpolation stage. The common positions of wavenumber-domain components to be interpolated are calculated by the CPU and stored in the constant memory for broadcast at any time. This avoids the repetitive computation consumed in a GPU-only scheme. Then the GPU is responsible for the remaining matrix-related steps and outputs the needed wavenumber-domain values. The imaging experiments verify the acceleration efficiency of the proposed method and demonstrate that the speedup ratio of our proposed method is more than 15 times of that by the CPU-only method, and more than 2 times of that by the GPU-only method.
Los estilos APA, Harvard, Vancouver, ISO, etc.
41

GARBA, MICHAEL T., and HORACIO GONZÁLEZ–VÉLEZ. "ASYMPTOTIC PEAK UTILISATION IN HETEROGENEOUS PARALLEL CPU/GPU PIPELINES: A DECENTRALISED QUEUE MONITORING STRATEGY." Parallel Processing Letters 22, no. 02 (2012): 1240008. http://dx.doi.org/10.1142/s0129626412400087.

Texto completo
Resumen
Widespread heterogeneous parallelism is unavoidable given the emergence of General-Purpose computing on graphics processing units (GPGPU). The characteristics of a Graphics Processing Unit (GPU)—including significant memory transfer latency and complex performance characteristics—demand new approaches to ensuring that all available computational resources are efficiently utilised. This paper considers the simple case of a divisible workload based on widely-used numerical linear algebra routines and the challenges that prevent efficient use of all resources available to a naive SPMD application using the GPU as an accelerator. We suggest a possible queue monitoring strategy that facilitates resource usage with a view to balancing the CPU/GPU utilisation for applications that fit the pipeline parallel architectural pattern on heterogeneous multicore/multi-node CPU and GPU systems. We propose a stochastic allocation technique that may serve as a foundation for heuristic approaches to balancing CPU/GPU workloads.
Los estilos APA, Harvard, Vancouver, ISO, etc.
42

Chen, Yong, Hai Jin, Han Jiang, Dechao Xu, Ran Zheng, and Haocheng Liu. "Implementation and Optimization of GPU-Based Static State Security Analysis in Power Systems." Mobile Information Systems 2017 (2017): 1–10. http://dx.doi.org/10.1155/2017/1897476.

Texto completo
Resumen
Static state security analysis (SSSA) is one of the most important computations to check whether a power system is in normal and secure operating state. It is a challenge to satisfy real-time requirements with CPU-based concurrent methods due to the intensive computations. A sensitivity analysis-based method with Graphics processing unit (GPU) is proposed for power systems, which can reduce calculation time by 40% compared to the execution on a 4-core CPU. The proposed method involves load flow analysis and sensitivity analysis. In load flow analysis, a multifrontal method for sparse LU factorization is explored on GPU through dynamic frontal task scheduling between CPU and GPU. The varying matrix operations during sensitivity analysis on GPU are highly optimized in this study. The results of performance evaluations show that the proposed GPU-based SSSA with optimized matrix operations can achieve a significant reduction in computation time.
Los estilos APA, Harvard, Vancouver, ISO, etc.
43

Ngo, Long Thanh, Dzung Dinh Nguyen, Long The Pham, and Cuong Manh Luong. "Speedup of Interval Type 2 Fuzzy Logic Systems Based on GPU for Robot Navigation." Advances in Fuzzy Systems 2012 (2012): 1–11. http://dx.doi.org/10.1155/2012/698062.

Texto completo
Resumen
As the number of rules and sample rate for type 2 fuzzy logic systems (T2FLSs) increases, the speed of calculations becomes a problem. The T2FLS has a large membership value of inherent algorithmic parallelism that modern CPU architectures do not exploit. In the T2FLS, many rules and algorithms can be speedup on a graphics processing unit (GPU) as long as the majority of computation a various stages and components are not dependent on each other. This paper demonstrates how to install interval type 2 fuzzy logic systems (IT2-FLSs) on the GPU and experiments for obstacle avoidance behavior of robot navigation. GPU-based calculations are high-performance solution and free up the CPU. The experimental results show that the performance of the GPU is many times faster than CPU.
Los estilos APA, Harvard, Vancouver, ISO, etc.
44

Echeverribar, Isabel, Mario Morales-Hernández, Pilar Brufau, and Pilar García-Navarro. "Analysis of the performance of a hybrid CPU/GPU 1D2D coupled model for real flood cases." Journal of Hydroinformatics 22, no. 5 (2020): 1198–216. http://dx.doi.org/10.2166/hydro.2020.032.

Texto completo
Resumen
Abstract Coupled 1D2D models emerged as an efficient solution for a two-dimensional (2D) representation of the floodplain combined with a fast one-dimensional (1D) schematization of the main channel. At the same time, high-performance computing (HPC) has appeared as an efficient tool for model acceleration. In this work, a previously validated 1D2D Central Processing Unit (CPU) model is combined with an HPC technique for fast and accurate flood simulation. Due to the speed of 1D schemes, a hybrid CPU/GPU model that runs the 1D main channel on CPU and accelerates the 2D floodplain with a Graphics Processing Unit (GPU) is presented. Since the data transfer between sub-domains and devices (CPU/GPU) may be the main potential drawback of this architecture, the test cases are selected to carry out a careful time analysis. The results reveal the speed-up dependency on the 2D mesh, the event to be solved and the 1D discretization of the main channel. Additionally, special attention must be paid to the time step size computation shared between sub-models. In spite of the use of a hybrid CPU/GPU implementation, high speed-ups are accomplished in some cases.
Los estilos APA, Harvard, Vancouver, ISO, etc.
45

Min, Seung Won, Kun Wu, Sitao Huang, et al. "Large graph convolutional network training with GPU-oriented data communication architecture." Proceedings of the VLDB Endowment 14, no. 11 (2021): 2087–100. http://dx.doi.org/10.14778/3476249.3476264.

Texto completo
Resumen
Graph Convolutional Networks (GCNs) are increasingly adopted in large-scale graph-based recommender systems. Training GCN requires the minibatch generator traversing graphs and sampling the sparsely located neighboring nodes to obtain their features. Since real-world graphs often exceed the capacity of GPU memory, current GCN training systems keep the feature table in host memory and rely on the CPU to collect sparse features before sending them to the GPUs. This approach, however, puts tremendous pressure on host memory bandwidth and the CPU. This is because the CPU needs to (1) read sparse features from memory, (2) write features into memory as a dense format, and (3) transfer the features from memory to the GPUs. In this work, we propose a novel GPU-oriented data communication approach for GCN training, where GPU threads directly access sparse features in host memory through zero-copy accesses without much CPU help. By removing the CPU gathering stage, our method significantly reduces the consumption of the host resources and data access latency. We further present two important techniques to achieve high host memory access efficiency by the GPU: (1) automatic data access address alignment to maximize PCIe packet efficiency, and (2) asynchronous zero-copy access and kernel execution to fully overlap data transfer with training. We incorporate our method into PyTorch and evaluate its effectiveness using several graphs with sizes up to 111 million nodes and 1.6 billion edges. In a multi-GPU training setup, our method is 65--92% faster than the conventional data transfer method, and can even match the performance of all-in-GPU-memory training for some graphs that fit in GPU memory.
Los estilos APA, Harvard, Vancouver, ISO, etc.
46

Lee, Chien Yu, H. S. Lin, and H. T. Yau. "Using Graphic Hardware to Accelerate Pocketing Tool-Path Generation." Applied Mechanics and Materials 311 (February 2013): 135–40. http://dx.doi.org/10.4028/www.scientific.net/amm.311.135.

Texto completo
Resumen
In this paper, we propose a new approach to accelerate the pocketing tool-path generation by using graphic hardware (graphic processing units, GPU). The intersections among tool-path elements can be eliminated with higher efficiency from GPU-based Voronoi diagrams. According to our experimental results, the GPU-based computation speed was seven to eight times faster than that of CPU-based computation. In addition, the difference of tool-path geometry between the CPU-based and GPU-based methods was insignificant. Therefore, the GPU-method can be efficiently used to accelerate the computation while the precision is assured for the tool-path generation in pocketing machining.
Los estilos APA, Harvard, Vancouver, ISO, etc.
47

Abramowicz, Kamil, and Przemysław Borczuk. "Comparative analysis of the performance of Unity and Unreal Engine game engines in 3D games." Journal of Computer Sciences Institute 30 (March 20, 2024): 53–60. http://dx.doi.org/10.35784/jcsi.5473.

Texto completo
Resumen
The article compared the performance of the Unity and Unreal Engine game engines based on tests conducted on two nearly identical games. The research focused on frames per second, CPU usage, RAM, and GPU memory. The results showed that Unity achieved a better average frame rate. Unreal Engine required more RAM and GPU resources. Analyzing CPU load values revealed that on the first system, Unity demanded less CPU usage. However, on the second system, Unreal Engine used over 10 percentage points less CPU. The conclusions from the research partially confirm the hypothesis that Unity requires fewer computer resources, although in some cases, Unreal Engine may demand fewer CPU resources.
Los estilos APA, Harvard, Vancouver, ISO, etc.
48

Wasiljew, A., and K. Murawski. "A new CUDA-based GPU implementation of the two-dimensional Athena code." Bulletin of the Polish Academy of Sciences: Technical Sciences 61, no. 1 (2013): 239–50. http://dx.doi.org/10.2478/bpasts-2013-0023.

Texto completo
Resumen
Abstract We present a new version of the Athena code, which solves magnetohydrodynamic equations in two-dimensional space. This new implementation, which we have named Athena-GPU, uses CUDA architecture to allow the code execution on Graphical Processor Unit (GPU). The Athena-GPU code is an unofficial, modified version of the Athena code which was originally designed for Central Processor Unit (CPU) architecture. We perform numerical tests based on the original Athena-CPU code and its GPU counterpart to make a performance analysis, which includes execution time, precision differences and accuracy. We narrowed our tests and analysis only to double precision floating point operations and two-dimensional test cases. Our comparison shows that results are similar for both two versions of the code, which confirms correctness of our CUDA-based implementation. Our tests reveal that the Athena-GPU code can be 2 to 15-times faster than the Athena-CPU code, depending on test cases, the size of a problem and hardware configuration.
Los estilos APA, Harvard, Vancouver, ISO, etc.
49

Tramm, John, Paul Romano, Patrick Shriwise, et al. "Performance Portable Monte Carlo Particle Transport on Intel, NVIDIA, and AMD GPUs." EPJ Web of Conferences 302 (2024): 04010. http://dx.doi.org/10.1051/epjconf/202430204010.

Texto completo
Resumen
OpenMC is an open source Monte Carlo neutral particle transport application that has recently been ported to GPU using the OpenMP target offloading model. We examine the performance of OpenMC at scale on the Frontier, Polaris, and Aurora supercomputers, demonstrating that performance portability has been achieved by OpenMC across all three major GPU vendors (AMD, NVIDIA, and Intel). OpenMC’s GPU performance is compared to both the traditional CPU-based version of OpenMC as well as several other state-of-the-art CPU-based Monte Carlo particle transport applications. We also provide historical context by analyzing OpenMC’s performance on several legacy GPU and CPU architectures. This work includes some of the first published results for a scientific simulation application at scale on a supercomputer featuring Intel’s Max series “Ponte Vecchio” GPUs. It is also one of the first demonstrations of a large scientific production application using the OpenMP target offloading model to achieve high performance on all three major GPU platforms.
Los estilos APA, Harvard, Vancouver, ISO, etc.
50

Preto, Bruno, Fernando Birra, Adriano Lopes, and Pedro Medeiros. "Object Identification in Binary Tomographic Images Using GPGPUs." International Journal of Creative Interfaces and Computer Graphics 4, no. 2 (2013): 40–56. http://dx.doi.org/10.4018/ijcicg.2013070103.

Texto completo
Resumen
The authors present a hybrid OpenCL CPU/GPU algorithm for identification of connected structures inside black and white 3D scientific data. This algorithm exploits parallelism both at CPU and GPGPU levels, but the work is predominantly done in GPUs. The underlying context of this work is the structural characterization of composite materials via tomography. The algorithm allows us to later infer location and morphology of objects inside composite materials. Moreover, execution times are very low thus allowing us to process large data sets, but within acceptable running times. Intermediate solutions are computed independently over a partition of the spatial domain, following the data parallelism paradigm, and then integrated both at GPU and CPU levels, using parallel multi-cores. The authors consistently explore parallelism both at the CPU level, by allowing the CPU stage to run in multiple concurrent threads, and at the GPU level with massive parallelism and concurrent data transfers and kernel executions.
Los estilos APA, Harvard, Vancouver, ISO, etc.
Ofrecemos descuentos en todos los planes premium para autores cuyas obras están incluidas en selecciones literarias temáticas. ¡Contáctenos para obtener un código promocional único!