Log in

Relevant bibliographies by topics / Nvidia CUDA / Journal articles

To see the other types of publications on this topic, follow the link: Nvidia CUDA.

Journal articles on the topic 'Nvidia CUDA'

Author: Grafiati

Published: 4 June 2021

Last updated: 6 September 2023

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Nvidia CUDA.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Nangla, Siddhante. "GPU Programming using NVIDIA CUDA." International Journal for Research in Applied Science and Engineering Technology 6, no. 6 (June 30, 2018): 79–84. http://dx.doi.org/10.22214/ijraset.2018.6016.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Pogorilyy, S. D., D. Yu Vitel, and O. A. Vereshchynsky. "Новітні архітектури відеоадаптерів. Технологія GPGPU. Частина 2." Реєстрація, зберігання і обробка даних 15, no. 1 (April 4, 2013): 71–81. http://dx.doi.org/10.35681/1560-9189.2013.15.1.103367.

Full text

Abstract:

Детально розглянуто основні принципи роботи зі спільною та розподіленою пам’яттю в технології NVidia CUDA. Описано шаблони взаємодії потоків і проблеми глобальної синхронізації. Проведено порівняльний аналіз основних технологій, що використовуються в підході GPGPU — Nvidia CUDA, OpenCL, Direct Compute.

APA, Harvard, Vancouver, ISO, and other styles

3

HURMAN, Ivan, Kira BOBROVNIKOVA, Leonid BEDRATYUK, and Hanna BEDRATYUK. "APPROACH FOR CODE ANALYSIS TO ESTIMATE POWER CONSUMPTION OF CUDA CORE." Herald of Khmelnytskyi National University. Technical sciences 217, no. 1 (February 23, 2023): 67–73. http://dx.doi.org/10.31891/2307-5732-2023-317-1-67-73.

Full text

Abstract:

The graphics processing unit is a popular computing device for achieving exascale performance in high-performance computing programs, which is used not only in graphics tasks, but also in computational tasks such as machine learning, scientific computing, and cryptography. With the help of a graphics processor, you can achieve significant speed and performance compared to the central processing unit. CUDA, Compute Unified Device Architecture, a graphics processing unit software development platform, allows developers to use the high-performance computing capabilities of graphics processing units to solve problems traditionally handled by central processing units. Even though the graphics processing unit has a relatively high power to performance ratio, it consumes a significant amount of power during computing. The paper proposes an approach for code analysis to estimate power consumption of CUDA core to improve the power efficiency of applications focused on computing on graphics processing units. The proposed approach makes it possible to estimate the power consumption of such applications without the need to run them on physical devices. The proposed approach is based on static analysis of the CUDA program and machine learning methods. To evaluate the effectiveness of the proposed approach, three graphics processing unit architectures were used: NVIDIA PASCAL, NVIDIA TURING, and NVIDIA AMPERE. The results of the experiments showed that for the NVIDIA AMPERE architecture, the proposed approach using decision trees makes it possible to achieve a determination coefficient of 0.9173. The results obtained confirm the effectiveness of the proposed code analysis method for estimating the power consumption of the CUDA core. This method can be useful for CUDA developers who want to improve the efficiency and power efficiency of their programs.

APA, Harvard, Vancouver, ISO, and other styles

4

Ahmed, Rafid, Md Sazzadul Islam, and Jia Uddin. "Optimizing Apple Lossless Audio Codec Algorithm using NVIDIA CUDA Architecture." International Journal of Electrical and Computer Engineering (IJECE) 8, no. 1 (February 1, 2018): 70. http://dx.doi.org/10.11591/ijece.v8i1.pp70-75.

Full text

Abstract:

As majority of the compression algorithms are implementations for CPU architecture, the primary focus of our work was to exploit the opportunities of GPU parallelism in audio compression. This paper presents an implementation of Apple Lossless Audio Codec (ALAC) algorithm by using NVIDIA GPUs Compute Unified Device Architecture (CUDA) Framework. The core idea was to identify the areas where data parallelism could be applied and parallel programming model CUDA could be used to execute the identified parallel components on Single Instruction Multiple Thread (SIMT) model of CUDA. The dataset was retrieved from European Broadcasting Union, Sound Quality Assessment Material (SQAM). Faster execution of the algorithm led to execution time reduction when applied to audio coding for large audios. This paper also presents the reduction of power usage due to running the parallel components on GPU. Experimental results reveal that we achieve about 80-90% speedup through CUDA on the identified components over its CPU implementation while saving CPU power consumption.

APA, Harvard, Vancouver, ISO, and other styles

5

Kim, Youngtae, and Gyuhyeon Hwang. "Efficient Parallel CUDA Random Number Generator on NVIDIA GPUs." Journal of KIISE 42, no. 12 (December 15, 2015): 1467–73. http://dx.doi.org/10.5626/jok.2015.42.12.1467.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

Semenenko, Julija, and Dmitrij Šešok. "Lygiagretūs skaičiavimai su CUDA." Jaunųjų mokslininkų darbai 47, no. 1 (July 3, 2017): 87–93. http://dx.doi.org/10.21277/jmd.v47i1.135.

Full text

Abstract:

Straipsnyje pateikiami NVIDIA CUDA skaičiavimų technologijos veikimo principai, darbo su CUDA ypatumai. Su „GeForce“ ir „Quadro“ grafinėmis plokštėmis bei CPU atlikti du skaitiniai eksperimentai – masyvų sudėtis ir matricų sandauga, matricų sandaugos optimizacijos (bendroji atmintis, „bankų konfliktų“ sprendimai, lygiagretinimas pagal instrukcijas), analizuojami vykdymo laiko sąnaudų rezultatai dirbant su int, float ir double duomenų tipais ir skirtingais duomenų skaičiais.

APA, Harvard, Vancouver, ISO, and other styles

7

Popov, S. E. "Improved phase unwrapping algorithm based on NVIDIA CUDA." Programming and Computer Software 43, no. 1 (January 2017): 24–36. http://dx.doi.org/10.1134/s0361768817010054.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Gonzalez Clua, Esteban Walter, and Marcelo Panaro Zamith. "Programming in CUDA for Kepler and Maxwell Architecture." Revista de Informática Teórica e Aplicada 22, no. 2 (November 21, 2015): 233. http://dx.doi.org/10.22456/2175-2745.56384.

Full text

Abstract:

Since the first version of CUDA was launch, many improvements were made in GPU computing. Every new CUDA version included important novel features, turning this architecture more and more closely related to a typical parallel High Performance Language. This tutorial will present the GPU architecture and CUDA principles, trying to conceptualize novel features included by NVIDIA, such as dynamics parallelism, unified memory and concurrent kernels. This text also includes some optimization remarks for CUDA programs.

APA, Harvard, Vancouver, ISO, and other styles

9

Маханьков, Алексей Владимирович, Максим Олегович Кузнецов, and Анатолий Дмитриевич Панферов. "Efficiency of using NVIDIA coprocessors in modeling the behavior of charge carriers in graphene." Program Systems: Theory and Applications 12, no. 1 (March 23, 2021): 115–28. http://dx.doi.org/10.25209/2079-3316-2021-12-1-115-128.

Full text

Abstract:

В развитии суперкомпьютерных технологий важную роль играют специализированные аппаратные решения. В настоящее время большинство вычислительных систем максимальной производительности используют математические сопроцессоры различных типов. По этой причине при разработке прикладных программных решений, рассчитанных на реализацию потенциала современных вычислительных платформ, необходимо обеспечить эффективное использование аппаратных ускорителей. В ходе работы над программной системой для моделирования поведения носителей заряда в графене необходимо было решить задачу поддержки ею таких ускорителей и исследовать эффективность полученного решения. С учётом текущей ситуации и перспективы ближайших лет выбор был сделан в пользу ускорителей NVIDIA и программной технологии CUDA. В силу того, что аппаратная архитектура ускорителей NVIDIA имеет принципиальные отличия от архитектуры CPU, а адаптированные для CUDA математические библиотеки не поддерживают весь спектр алгоритмов, использовавшихся в исходной версии программы, потребовалось найти новые решения и оценить их эффективность. В работе представлены особенности реализации поддержки CUDA и результаты сравнительного тестирования полученного решения на примере задачи с реалистическими характеристиками.

APA, Harvard, Vancouver, ISO, and other styles

10

Liu, Zhi Yuan, and Xue Zhang Zhao. "Research and Implementation of Image Rotation Based on CUDA." Advanced Materials Research 216 (March 2011): 708–12. http://dx.doi.org/10.4028/www.scientific.net/amr.216.708.

Full text

Abstract:

GPU technology release CPU from burdensome graphic computing task. The nVIDIA company, the main GPU producer, adds CUDA technology in new GPU models which enhances GPU function greatly and has much advantage in computing complex matrix. General algorithms of image rotation and the structure of CUDA are introduced in this paper. An example of rotating an image by using HALCON based on CPU instruction extensions and CUDA technology is to prove the advantage of CUDA by comparing two results.

APA, Harvard, Vancouver, ISO, and other styles

11

Lo, Win-Tsung, Yue-Shan Chang, Ruey-Kai Sheu, Chun-Chieh Chiu, and Shyan-Ming Yuan. "CUDT: A CUDA Based Decision Tree Algorithm." Scientific World Journal 2014 (2014): 1–12. http://dx.doi.org/10.1155/2014/745640.

Full text

Abstract:

Decision tree is one of the famous classification methods in data mining. Many researches have been proposed, which were focusing on improving the performance of decision tree. However, those algorithms are developed and run on traditional distributed systems. Obviously the latency could not be improved while processing huge data generated by ubiquitous sensing node in the era without new technology help. In order to improve data processing latency in huge data mining, in this paper, we design and implement a new parallelized decision tree algorithm on a CUDA (compute unified device architecture), which is a GPGPU solution provided by NVIDIA. In the proposed system, CPU is responsible for flow control while the GPU is responsible for computation. We have conducted many experiments to evaluate system performance of CUDT and made a comparison with traditional CPU version. The results show that CUDT is 5∼55 times faster than Weka-j48 and is 18 times speedup than SPRINT for large data set.

APA, Harvard, Vancouver, ISO, and other styles

12

Lin, Chun-Yuan, Chung-Hung Wang, Che-Lun Hung, and Yu-Shiang Lin. "Accelerating Multiple Compound Comparison Using LINGO-Based Load-Balancing Strategies on Multi-GPUs." International Journal of Genomics 2015 (2015): 1–9. http://dx.doi.org/10.1155/2015/950905.

Full text

Abstract:

Compound comparison is an important task for the computational chemistry. By the comparison results, potential inhibitors can be found and then used for the pharmacy experiments. The time complexity of a pairwise compound comparison isO(n2), wherenis the maximal length of compounds. In general, the length of compounds is tens to hundreds, and the computation time is small. However, more and more compounds have been synthesized and extracted now, even more than tens of millions. Therefore, it still will be time-consuming when comparing with a large amount of compounds (seen as a multiple compound comparison problem, abbreviated to MCC). The intrinsic time complexity of MCC problem isO(k2n2)withkcompounds of maximal lengthn. In this paper, we propose a GPU-based algorithm for MCC problem, called CUDA-MCC, on single- and multi-GPUs. Four LINGO-based load-balancing strategies are considered in CUDA-MCC in order to accelerate the computation speed among thread blocks on GPUs. CUDA-MCC was implemented by C+OpenMP+CUDA. CUDA-MCC achieved 45 times and 391 times faster than its CPU version on a single NVIDIA Tesla K20m GPU card and a dual-NVIDIA Tesla K20m GPU card, respectively, under the experimental results.

APA, Harvard, Vancouver, ISO, and other styles

13

FUJIMOTO, NORIYUKI. "DENSE MATRIX-VECTOR MULTIPLICATION ON THE CUDA ARCHITECTURE." Parallel Processing Letters 18, no. 04 (December 2008): 511–30. http://dx.doi.org/10.1142/s0129626408003545.

Full text

Abstract:

Recently GPUs have acquired the ability to perform fast general purpose computation by running thousands of threads concurrently. This paper presents a new algorithm for dense matrix-vector multiplication on the NVIDIA CUDA architecture. The experiments are conducted on a PC with GeForce 8800GTX and 2.0 GHz Intel Xeon E5335 CPU. The results show that the proposed algorithm runs a maximum of 11.19 times faster than NVIDIA's BLAS library CUBLAS 1.1 on the GPU and 35.15 times faster than the Intel Math Kernel Library 9.1 on a single core x86 with SSE3 SIMD instructions. The performance of Jacobi's iterative method for solving linear equations, which includes the data transfer time between CPU and GPU, shows that the proposed algorithm is practical for real applications.

APA, Harvard, Vancouver, ISO, and other styles

14

Cao, Kai, Qizhong Wu, Lingling Wang, Nan Wang, Huaqiong Cheng, Xiao Tang, Dongqing Li, and Lanning Wang. "GPU-HADVPPM V1.0: a high-efficiency parallel GPU design of the piecewise parabolic method (PPM) for horizontal advection in an air quality model (CAMx V6.10)." Geoscientific Model Development 16, no. 15 (August 1, 2023): 4367–83. http://dx.doi.org/10.5194/gmd-16-4367-2023.

Full text

Abstract:

Abstract. With semiconductor technology gradually approaching its physical and thermal limits, graphics processing units (GPUs) are becoming an attractive solution for many scientific applications due to their high performance. This paper presents an application of GPU accelerators in an air quality model. We demonstrate an approach that runs a piecewise parabolic method (PPM) solver of horizontal advection (HADVPPM) for the air quality model CAMx on GPU clusters. Specifically, we first convert the HADVPPM to a new Compute Unified Device Architecture C (CUDA C) code to make it computable on the GPU (GPU-HADVPPM). Then, a series of optimization measures are taken, including reducing the CPU–GPU communication frequency, increasing the data size computation on the GPU, optimizing the GPU memory access, and using thread and block indices to improve the overall computing performance of the CAMx model coupled with GPU-HADVPPM (named the CAMx-CUDA model). Finally, a heterogeneous, hybrid programming paradigm is presented and utilized with GPU-HADVPPM on the GPU clusters with a message passing interface (MPI) and CUDA. The offline experimental results show that running GPU-HADVPPM on one NVIDIA Tesla K40m and an NVIDIA Tesla V100 GPU can achieve up to a 845.4× and 1113.6× acceleration. By implementing a series of optimization schemes, the CAMx-CUDA model results in a 29.0× and 128.4× improvement in computational efficiency by using a GPU accelerator card on a K40m and V100 cluster, respectively. In terms of the single-module computational efficiency of GPU-HADVPPM, it can achieve 1.3× and 18.8× speedup on an NVIDIA Tesla K40m GPU and NVIDIA Tesla V100 GPU, respectively. The multi-GPU acceleration algorithm enables a 4.5× speedup with eight CPU cores and eight GPU accelerators on a V100 cluster.

APA, Harvard, Vancouver, ISO, and other styles

15

Borcovas, Evaldas, and Gintautas Daunys. "CPU AND GPU (CUDA) TEMPLATE MATCHING COMPARISON / CPU IR GPU (CUDA) PALYGINIMAS VYKDANT ŠABLONŲ ATITIKTIES ALGORITMĄ." Mokslas – Lietuvos ateitis 6, no. 2 (April 24, 2014): 129–33. http://dx.doi.org/10.3846/mla.2014.16.

Full text

Abstract:

Image processing, computer vision or other complicated opticalinformation processing algorithms require large resources. It isoften desired to execute algorithms in real time. It is hard tofulfill such requirements with single CPU processor. NVidiaproposed CUDA technology enables programmer to use theGPU resources in the computer. Current research was madewith Intel Pentium Dual-Core T4500 2.3 GHz processor with4 GB RAM DDR3 (CPU I), NVidia GeForce GT320M CUDAcompliable graphics card (GPU I) and Intel Core I5-2500K3.3 GHz processor with 4 GB RAM DDR3 (CPU II), NVidiaGeForce GTX 560 CUDA compatible graphic card (GPU II).Additional libraries as OpenCV 2.1 and OpenCV 2.4.0 CUDAcompliable were used for the testing. Main test were made withstandard function MatchTemplate from the OpenCV libraries.The algorithm uses a main image and a template. An influenceof these factors was tested. Main image and template have beenresized and the algorithm computing time and performancein Gtpix/s have been measured. According to the informationobtained from the research GPU computing using the hardwarementioned earlier is till 24 times faster when it is processing abig amount of information. When the images are small the performanceof CPU and GPU are not significantly different. Thechoice of the template size makes influence on calculating withCPU. Difference in the computing time between the GPUs canbe explained by the number of cores which they have. Vaizdų apdorojimas, kompiuterinė rega ir kiti sudėtingi algoritmai, apdorojantys optinę informaciją, naudoja dideliusskaičiavimo išteklius. Dažnai šiuos algoritmus reikia realizuoti realiuoju laiku. Šį uždavinį išspręsti naudojant tik vienoCPU (angl. Central processing unit) pajėgumus yra sudėtinga. nVidia pasiūlyta CUDA (angl. Compute unified device architecture)technologija leidžia panaudoti GPU (angl. Graphic processing unit) išteklius. Tyrimui atlikti buvo pasirinkti du skirtingiCPU: Intel Pentium Dual-Core T4500 ir Intel Core I5 2500K, bei GPU: nVidia GeForce GT320M ir NVidia GeForce 560.Tyrime buvo panaudotos vaizdų apdorojimo bibliotekos: OpenCV 2.1 ir OpenCV 2.4. Tyrimui buvo pasirinktas šablonų atitiktiesalgoritmas. Algoritmui realizuoti reikalingas analizuojamas vaizdas ir ieškomo objekto vaizdo šablonas. Tyrimo metu buvokeičiamas vaizdo ir šablono dydis bei stebima, kaip tai veikia algoritmo vykdymo trukmę ir vykdomų operacijų skaičių persekundę. Iš gautų rezultatų galima teigti, kad apdorojant didelį duomenų kiekį GPU realizuoja algoritmą iki 24 kartų greičiaunei tik CPU. Dirbant su nedideliu duomenų kiekiu, skirtumas tarp CPU ir GPU yra minimalus. Lyginant skaičiavimus dviejuoseGPU, pastebėta, kad skaičiavimų sparta yra tiesiogiai proporcinga GPU turimų branduolių kiekiui. Mūsų tyrimo atvejuspartesniame GPU jų buvo 16 kartų daugiau, tad ir skaičiavimai vyko 16 kartų sparčiau.

APA, Harvard, Vancouver, ISO, and other styles

16

Borisov, A. N., and E. V. Myasnikov. "The implementation of ”Kuznyechik” encryption algorithm using NVIDIA CUDA technology." Information Technology and Nanotechnology, no. 2416 (2019): 308–13. http://dx.doi.org/10.18287/1613-0073-2019-2416-308-313.

Full text

Abstract:

In this paper, we discuss various options for implementing the ”Kuznyechik” block encryption algorithm using the NVIDIA CUDA technology. We use lookup tables as a basis for the implementation. In experiments, we study the influence of the size of the block of threads and the location of lookup tables on the encryption speed. We show that the best results are obtained when the lookup tables are stored in the global memory. The peak encryption speed reaches 30.83 Gbps on the NVIDIA GeForce GTX 1070 graphics processor.

APA, Harvard, Vancouver, ISO, and other styles

17

Mu, Tian Hong, and Yun Yang. "A Method for Binary Image Component Parallel Labeling Algorithm Based on CUDA." Advanced Materials Research 811 (September 2013): 538–42. http://dx.doi.org/10.4028/www.scientific.net/amr.811.538.

Full text

Abstract:

More internal transistor of GPU is used as a data processing rather than process control. Compared with the existing multinuclear CPU, it has more processors and higher ability of the whole parallel processing, which is suitable for a large scale super calculation based on desktop platform. CUDA platform, put forward by NVIDIA Company, which is a new hardware and software architecture of realized the general calculation of GPU combined with the high parallel ability, and adopt CUDAC programming language to realize a parallel binary image connected domain label algorithm based on CUDA. The algorithm uses eight connection body labels, which has the high parallel ability, the less association between steps and the efficiency of the great promotion space.

APA, Harvard, Vancouver, ISO, and other styles

18

Shangareeva, G. R., and S. A. Mustafina. "Parallelization of the conjugate gradient method using the technology NVidia Cuda." Scientific Bulletin, no. 2 (2014): 155–62. http://dx.doi.org/10.17117/nv.2014.02.155.

Full text

APA, Harvard, Vancouver, ISO, and other styles

19

Pala, Artur, and Marek Machaczek. "Computing of 3D Bifurcation Diagrams With Nvidia CUDA Technology." IEEE Access 8 (2020): 157773–80. http://dx.doi.org/10.1109/access.2020.3019633.

Full text

APA, Harvard, Vancouver, ISO, and other styles

20

Afif, Mouna, Yahia Said, and Mohamed Atri. "Computer vision algorithms acceleration using graphic processors NVIDIA CUDA." Cluster Computing 23, no. 4 (March 17, 2020): 3335–47. http://dx.doi.org/10.1007/s10586-020-03090-6.

Full text

APA, Harvard, Vancouver, ISO, and other styles

21

Zhu, Li, and Yi Min Yang. "Real-Time Multitasking Video Encoding Processing System of Multicore." Applied Mechanics and Materials 66-68 (July 2011): 2074–79. http://dx.doi.org/10.4028/www.scientific.net/amm.66-68.2074.

Full text

Abstract:

This paper achieved the optimize which is based on the Series processors Produced by NVIDIA, such as Geforce, Tegra, Nexus and so on, and discussed the future development of the video image processor. Expounded the most popular DSP optimization techniques and objectives in the current, to optimized the design for the methods of the various papers available in existence. Based on the NVIDIA's series of products, specific discussed CUDA GPU architecture based on NVIDIA's products, raised the hardware and algorithms of the current most popular video encoding equipment, based on real practical technology to improve the transmission and encoding of multimedia data.

APA, Harvard, Vancouver, ISO, and other styles

22

Андрианов, А. Н., Т. П. Баранова, А. Б. Бугеря, and К. Н. Ефимкин. "Distribution of computations in hybrid computing systems when translating NORMA language programs." Numerical Methods and Programming (Vychislitel'nye Metody i Programmirovanie), no. 3 (June 14, 2019): 224–36. http://dx.doi.org/10.26089/nummet.v20r321.

Full text

Abstract:

Рассмотрены методы распределения вычислительной нагрузки при трансляции программ с непроцедурного (декларативного) языка НОРМА в исполняемые программы для различных параллельных архитектур. Приведены краткие характеристики языка НОРМА и основные возможности компилятора программ на языке НОРМА. Описаны способы автоматического распределения вычислительной нагрузки при генерации исполняемых программ следующих типов: OpenMP, NVIDIA CUDA, MPIOpenMP и MPIOpenMPNVIDIA CUDA. Рассмотрена задача динамической балансировки вычислительной нагрузки, возникающая в случае неоднородной вычислительной среды MPIOpenMPNVIDIA CUDA, и предложен метод ее решения. Приведены результаты практического применения компилятора программ на языке НОРМА для решения двух различных задач и оценена скорость выполнения получаемых при этом исполняемых программ для различных параллельных архитектур. The methods of computational load distribution when translating programs from the nonprocedural (declarative) NORMA language into executable programs for various parallel architectures are discussed. Some brief characteristics of the NORMA language and the main features of the compiler for programs in NORMA language are given. The methods of automatic distribution of computational load when generating executable programs of the following types are described: OpenMP, NVIDIA CUDA, MPIOpenMP, and MPIOpenMPNVIDIA CUDA. The problem of dynamic computational load balancing arising in the case of the heterogeneous computing environment MPIOpenMPNVIDIA CUDA is considered and a method of solving it is proposed. The results of practical application of the compiler for the programs in NORMA language for solving two different mathematical problems are given and the performance of the resulting executable programs is estimated for various parallel architectures.

APA, Harvard, Vancouver, ISO, and other styles

23

Sherbakov, S. S., and M. M. Polestchuk. "Acceleration of boundary element calculations for closed domain using nonlinear form functions and CUDA technology." Doklady BGUIR 19, no. 3 (June 2, 2021): 14–21. http://dx.doi.org/10.35596/1729-7648-2021-19-14-21.

Full text

Abstract:

The evolution of computer technologies, as a hardware and a software parts, allows to attain fast and accurate solutions to many applied problems in scientific areas. Acceleration of calculations is broadly used technic that is basically implemented by multithreading and multicore processors. NVidia CUDA technology or simply CUDA opens a way to efficient acceleration of boundary elements method (BEM), that includes many independent stages. The main goal of the paper is implementation and acceleration of indirect boundary element method using three form functions. Calculation of the potentialdistribution inside a closed boundary under the action of the defined boundary condition is considered. In order to accelerate corresponding calculations, they were parallelized at the graphic accelerator using NVidia CUDA technology. The dependences of acceleration of parallel computations as compared with sequential ones were explored for different numbers of boundary elements and computational nodes. A significant acceleration (up to 52 times) calculation of the potential distribution without loss in accuracy is shown. Acceleration of up to 22 times was achieved in calculation of mutual influence matrix for boundary elements. Using CUDA technology allows to attain significant acceleration without loss in accuracy and convergence. So application of CUDA is a good way to parallelizing BEM. Application of developed approach allows to solve problems in different areas of physics such as acoustics, hydromechanics, electrodynamics, mechanics of solids and many other areas, efficiently.

APA, Harvard, Vancouver, ISO, and other styles

24

Kommera, Pranay Reddy, Vinay Ramakrishnaiah, Christine Sweeney, Jeffrey Donatelli, and Petrus H. Zwart. "GPU-accelerated multitiered iterative phasing algorithm for fluctuation X-ray scattering." Journal of Applied Crystallography 54, no. 4 (July 30, 2021): 1179–88. http://dx.doi.org/10.1107/s1600576721005744.

Full text

Abstract:

The multitiered iterative phasing (MTIP) algorithm is used to determine the biological structures of macromolecules from fluctuation scattering data. It is an iterative algorithm that reconstructs the electron density of the sample by matching the computed fluctuation X-ray scattering data to the external observations, and by simultaneously enforcing constraints in real and Fourier space. This paper presents the first ever MTIP algorithm acceleration efforts on contemporary graphics processing units (GPUs). The Compute Unified Device Architecture (CUDA) programming model is used to accelerate the MTIP algorithm on NVIDIA GPUs. The computational performance of the CUDA-based MTIP algorithm implementation outperforms the CPU-based version by an order of magnitude. Furthermore, the Heterogeneous-Compute Interface for Portability (HIP) runtime APIs are used to demonstrate portability by accelerating the MTIP algorithm across NVIDIA and AMD GPUs.

APA, Harvard, Vancouver, ISO, and other styles

25

Kolosov, A. D., V. O. Gorovoy, and V. V. Kondratiev. "A SERVICE FOR SOLVING BOOLEAN SATISFIABILITY PROBLEM USING NVIDIA CUDA TECHNOLOGY." Modern technologies. System analysis. Modeling 4, no. 56 (2017): 107–14. http://dx.doi.org/10.26731/1813-9108.2017.4(56).107-114.

Full text

APA, Harvard, Vancouver, ISO, and other styles

26

Kusuma, Arjuna Wahyu, R. Damanhuri, Muhamad Nur Baihaqi, and Labib Habibie Sanjaya. "Image Encryption and Decryption Using Vigenere Cipher with Compute Unified Device Architecture (CUDA)." JURNAL MASYARAKAT INFORMATIKA 14, no. 1 (June 21, 2023): 29–37. http://dx.doi.org/10.14710/jmasif.14.1.51670.

Full text

Abstract:

Compute Unified Device Architecture (CUDA) adalah Application Programming Interface (API) NVIDIA dan platform yang memungkinkan akses langsung ke set instruksi GPU dan memberi dukungan untuk berinteraksi dengan GPU terkait komputasi paralel. Dengan CUDA, komputasi yang kompleks menjadi lebih cepat dan lebih efisien. Vigenere Cipher adalah kriptografi klasik populer yang mengimplementasikan kunci simetris dengan panjang tertentu. Pada penelitian ini, penerapan enkripsi dan dekripsi Vigenere Cipher dilakukan pada citra serta dengan CPU dan GPU (CUDA). Paralelisasi dengan CUDA menunjukkan hasil eksekusi waktu yang relatif lebih cepat daripada CPU. Persentase rata-rata penurunan waktu adalah 99,46 persen untuk enkripsi serta 99,47 persen untuk dekripsi.Persentase rata-rata penurunan waktu adalah 99,46 persen untuk enkripsi serta 99,47persen untuk dekripsi.

APA, Harvard, Vancouver, ISO, and other styles

27

Vintache, Damien, Bernard Humbert, and David Brasse. "Iterative reconstruction for transmission tomography on GPU using Nvidia CUDA." Tsinghua Science and Technology 15, no. 1 (February 2010): 11–16. http://dx.doi.org/10.1016/s1007-0214(10)70002-x.

Full text

APA, Harvard, Vancouver, ISO, and other styles

28

Gao, Yuan, Yin Sun, Chun Hui Zhou, Xin Su, Xi Bin Xu, and Shi Dong Zhou. "Accelerating the 3GPP LTE System Level Simulation with NVidia CUDA." Applied Mechanics and Materials 58-60 (June 2011): 1596–601. http://dx.doi.org/10.4028/www.scientific.net/amm.58-60.1596.

Full text

Abstract:

With the rapid progress of standardization of 3GPP’s LTE (Long Term Evolution) and LTE-Advanced, many research attentions have been focused on the link level evaluations of the 3GPP LTE systems, so as to demonstrate the rationality of novel transmission techniques. Different from theoretical studies, incorporating novel transmission techniques in to the LTE communication systems may affect many parts of the systems, such as signaling process, reference signal design, feedback link design, and compatibility, etc. Link level studies might be too simple to evaluate the benefits of these novel techniques to the entire system. On the other hand, system level simulation concentrates on the performance of the entire network with tens of cells and hundreds to thousands of users. It is possible to illustrate the actual performance of a LTE system by simulations designed from a system standpoint. Since the simulated system is quite large, one can understand the speed of simulation is very important for system level simulation platform. In this paper, we propose a design of Matlab-based 3GPP LTE system level simulator, which makes use of parallel computing techniques supported by NVidia GeForce GTX 260 graphic card. Our simulation experience shows that the simulation time reduces by nearly 1/3 after employing parallel computing techniques.

APA, Harvard, Vancouver, ISO, and other styles

29

Blyth, Simon. "Meeting the challenge of JUNO simulation with Opticks: GPU optical photon acceleration via NVIDIA® OptiXTM." EPJ Web of Conferences 245 (2020): 11003. http://dx.doi.org/10.1051/epjconf/202024511003.

Full text

Abstract:

Opticks is an open source project that accelerates optical photon simulation by integrating NVIDIA GPU ray tracing, accessed via NVIDIA OptiX, with Geant4 toolkit based simulations. A single NVIDIA Turing architecture GPU has been measured to provide optical photon simulation speedup factors exceeding 1500 times single threaded Geant4 with a full JUNO analytic GPU geometry automatically translated from the Geant4 geometry. Optical physics processes of scattering, absorption, scintillator reemission and boundary processes are implemented within CUDA OptiX programs based on the Geant4 implementations. Wavelength-dependent material and surface properties as well as inverse cumulative distribution functions for reemission are interleaved into GPU textures providing fast interpolated property lookup or wavelength generation. Major recent developments enable Opticks to benefit from ray trace dedicated RT cores available in NVIDIA RTX series GPUs. Results of extensive validation tests are presented.

APA, Harvard, Vancouver, ISO, and other styles

30

Bi, Yujiang, Yi Xiao, WeiYi Guo, Ming Gong, Peng Sun, Shun Xu, and Yi-bo Yang. "Lattice QCD GPU Inverters on ROCm Platform." EPJ Web of Conferences 245 (2020): 09008. http://dx.doi.org/10.1051/epjconf/202024509008.

Full text

Abstract:

The open source ROCm/HIP platform for GPU computing provides a uniform framework to support both the NVIDIA and AMD GPUs, and also the possibility to porting the CUDA code to the HIP-compatible one. We present the porting progress on the Overlap fermion inverter (GWU-code) and also the general Lattice QCD inverter package - QUDA. The manual of using QUDA on HIP and also the tips of porting general CUDA code into the HIP framework are also provided.

APA, Harvard, Vancouver, ISO, and other styles

31

Syrocki, Łukasz, and Grzegorz Pestka. "Implementation of algebraic procedures on the GPU using CUDA architecture on the example of generalized eigenvalue problem." Open Computer Science 6, no. 1 (May 13, 2016): 79–90. http://dx.doi.org/10.1515/comp-2016-0006.

Full text

Abstract:

AbstractThe ready to use set of functions to facilitate solving a generalized eigenvalue problem for symmetric matrices in order to efficiently calculate eigenvalues and eigenvectors, using Compute Unified Device Architecture (CUDA) technology from NVIDIA, is provided. An integral part of the CUDA is the high level programming environment enabling tracking both code executed on Central Processing Unit and on Graphics Processing Unit. The presented matrix structures allow for the analysis of the advantages of using graphics processors in such calculations.

APA, Harvard, Vancouver, ISO, and other styles

32

Lin, Chun-Yuan, Jin Ye, Che-Lun Hung, Chung-Hung Wang, Min Su, and Jianjun Tan. "Constructing a Bioinformatics Platform with Web and Mobile Services Based on NVIDIA Jetson TK1." International Journal of Grid and High Performance Computing 7, no. 4 (October 2015): 57–73. http://dx.doi.org/10.4018/ijghpc.2015100105.

Full text

Abstract:

Current high-end graphics processing units (abbreviate to GPUs), such as NVIDIA Tesla, Fermi, Kepler series cards which contain up to thousand cores per-chip, are widely used in the high performance computing fields. These GPU cards (called desktop GPUs) should be installed in personal computers/servers with desktop CPUs; moreover, the cost and power consumption of constructing a high performance computing platform with these desktop CPUs and GPUs are high. NVIDIA releases Tegra K1, called Jetson TK1, which contains 4 ARM Cortex-A15 CPUs and 192 CUDA cores (Kepler GPU) and is an embedded board with low cost, low power consumption and high applicability advantages for embedded applications. NVIDIA Jetson TK1 becomes a new research direction. Hence, in this paper, a bioinformatics platform was constructed based on NVIDIA Jetson TK1. ClustalWtk and MCCtk tools for sequence alignment and compound comparison were designed on this platform, respectively. Moreover, the web and mobile services for these two tools with user friendly interfaces also were provided. The experimental results showed that the cost-performance ratio by NVIDIA Jetson TK1 is higher than that by Intel XEON E5-2650 CPU and NVIDIA Tesla K20m GPU card.

APA, Harvard, Vancouver, ISO, and other styles

33

Blyth, Simon. "Integration of JUNO simulation framework with Opticks: GPU accelerated optical propagation via NVIDIA® OptiX™." EPJ Web of Conferences 251 (2021): 03009. http://dx.doi.org/10.1051/epjconf/202125103009.

Full text

Abstract:

Opticks is an open source project that accelerates optical photon simulation by integrating NVIDIA GPU ray tracing, accessed via NVIDIA OptiX, with Geant4 toolkit based simulations. A single NVIDIA Turing architecture GPU has been measured to provide optical photon simulation speedup factors exceeding 1500 times single threaded Geant4 with a full JUNO analytic GPU geometry automatically translated from the Geant4 geometry. Optical physics processes of scattering, absorption, scintillator reemission and boundary processes are implemented within CUDA OptiX programs based on the Geant4 implementations. Wavelength-dependent material and surface properties as well as inverse cumulative distribution functions for reemission are interleaved into GPU textures providing fast interpolated property lookup or wavelength generation. In this work we describe major recent developments to facilitate integration of Opticks with the JUNO simulation framework including on GPU collection effciency hit culling which substantially reduces both the CPU memory needed for photon hits and copying overheads. Also progress with the migration of Opticks to the all new NVIDIA OptiX 7 API is described.

APA, Harvard, Vancouver, ISO, and other styles

34

Wang, Wei Ling. "JPEG2000 Image Compression Method Based on GPGPU." Advanced Materials Research 756-759 (September 2013): 1314–19. http://dx.doi.org/10.4028/www.scientific.net/amr.756-759.1314.

Full text

Abstract:

In order to improve the compression speed of JPEG2000, the JPEG2000 compression standard is analysised and it concluded that the part data of the core algorithm that is DWT in JPEG2000 are independent from each other, so it is very suitable for parallel processing. CUDA (Compute Unified Device Architecture) is a latest software and hardware exploitation platform released by NVIDIA which is very suitable for large-scale data parallel computing.Using CUDA technology on general purpose graphic process unit (GPGPU) could speed up DWT algorithm parallelly and the program is optimized based on the characteristics of GPGPU storage space. The obtained experimental results show the DWT algorithm that is optimized by CUDA parallelly can improve the computing speed.

APA, Harvard, Vancouver, ISO, and other styles

35

Mamri, Ayoub, Mohamed Abouzahir, Mustapha Ramzi, and Rachid Latif. "ORB-SLAM accelerated on heterogeneous parallel architectures." E3S Web of Conferences 229 (2021): 01055. http://dx.doi.org/10.1051/e3sconf/202122901055.

Full text

Abstract:

SLAM algorithm permits the robot to cartography the desired environment while positioning it in space. It is a more efficient system and more accredited by autonomous vehicle navigation and robotic application in the ongoing research. Except it did not adopt any complete end-to-end hardware implementation yet. Our work aims to a hardware/software optimization of an expensive computational time functional block of monocular ORB-SLAM2. Through this, we attempt to implement the proposed optimization in FPGA-based heterogeneous embedded architecture that shows attractive results. Toward this, we adopt a comparative study with other heterogeneous architecture including powerful embedded GPGPU (NVIDIA Tegra TX1) and high-end GPU (NVIDIA GeForce 920MX). The implementation is achieved using high-level synthesis-based OpenCL for FPGA and CUDA for NVIDIA targeted boards.

APA, Harvard, Vancouver, ISO, and other styles

36

DRANGA, Diana, and Radu-Daniel BOLCAȘ. "Artificial Intelligence Enhancements in the field of Functional Verification." Electrotehnica, Electronica, Automatica 69, no. 4 (November 15, 2021): 95–102. http://dx.doi.org/10.46904/eea.21.69.4.1108011.

Full text

Abstract:

Functional Verification is one of the main processes in the Research and Development of new System-on-Chip. As chips are becoming more and more complex, this step becomes an extensive bottleneck which can vastly delay the chip mass production. It is a mandatory step as the design needs to not contain any faults, to ensure proper functioning. If this step is bypassed, large major financial losses and customer dissatisfaction can happen later in the process. Additionally, if the verification process is prolonging for a long period of time, to achieve a higher quality product, it will also cause a financial impact. Therefore, the solution is to find ways to optimize this activity. This paper contains a review on how Artificial Intelligence can reduce this blockage, taking into consideration the time spent on implementing the verification environment and the time of attaining the aimed coverage percentage. The engineer will take a decision on which causes of time-consuming processes presented in the paper will be reduced, depending on project specifics and his or her experience. A candidate for optimizing the training of the Neural Network is the Nvidia’s Computer Unified Device Architecture (CUDA). CUDA is parallel computing platform that make use of the GPU, peculiarly of the CUDA cores located inside Nvidia GPUs.

APA, Harvard, Vancouver, ISO, and other styles

37

Pérez, Juan Ignacio, Eliseo García, José A. de Frutos, and Felipe Cátedra. "Application of the Characteristic Basis Function Method Using CUDA." International Journal of Antennas and Propagation 2014 (2014): 1–13. http://dx.doi.org/10.1155/2014/721580.

Full text

Abstract:

The characteristic basis function method (CBFM) is a popular technique for efficiently solving the method of moments (MoM) matrix equations. In this work, we address the adaptation of this method to a relatively new computing infrastructure provided by NVIDIA, the Compute Unified Device Architecture (CUDA), and take into account some of the limitations which appear when the geometry under analysis becomes too big to fit into the Graphics Processing Unit’s (GPU’s) memory.

APA, Harvard, Vancouver, ISO, and other styles

38

Zhou, Yan, Tian Nan, Ya Li Cui, Tang Pei Cheng, and Jing Li Shao. "Numerical Simulation of Groundwater Flow Based on CUDA." Applied Mechanics and Materials 556-562 (May 2014): 3527–31. http://dx.doi.org/10.4028/www.scientific.net/amm.556-562.3527.

Full text

Abstract:

In this work, in order to improve the operation speed of groundwater flow numerical model, we studied the approach of solving linear equations and the relative acceleration problems based on CUDA platform. We developed the GPCG module base on GPU platform to replace PCG module of MODFLOW 2005 by using NVIDIA TESLA C2070. We obtained the effective acceleration results on GPU platform, by establishing series of ideal and instance models, which showed that the overall speedup of the models is around 2.5times and the calculation speedup is about 10times.

APA, Harvard, Vancouver, ISO, and other styles

39

Никитин, В. В., А. А. Дучков, and Ф. Андерссон. "Acceleration of seismic data processing with wave-packet decomposition using NVidia CUDA." Numerical Methods and Programming (Vychislitel'nye Metody i Programmirovanie), no. 3 (August 31, 2017): 293–311. http://dx.doi.org/10.26089/nummet.v18r326.

Full text

Abstract:

Сейсмические данные характеризуются своей нерегулярностью, многомерностью и большим объемом. В настоящей статье рассматривается разложение данных по одному из наиболее оптимальных базисов - гауссовым волновым пакетам. На базе графических процессоров реализован и оптимизирован быстрый алгоритм прямого и обратного преобразования по трехмерным гауссовым волновым пакетам. Оптимизированная версия программы для графических ускорителей демонстрирует рост производительности в 2-6 раз по сравнению с 20-ядерным процессором.Проведено успешное тестирование алгоритмов на синтетических сейсмических данных: восстановление изображения по коэффициентам гауссовых волновых пакетов, сжатие данных, подавление шумов данных, интерполяция данных в случае пропущенных трасс. Seismic data are characterized by multidimensionality, large data sizes, and irregular structures. In this paper we consider an optimal decomposition of seismic data using the basis of Gaussian wave packets. We implemented and optimized a number of fast algorithms for forward and inverse transforms for three-dimensional seismic data decomposition. The algorithms implemented on GPU demonstrate 2-6 speedup compared to 20-core CPUs. The programs were tested on synthetic seismic data sets: data reconstruction by Gaussian wave-packet coefficients, data compression, denoising, and interpolation in the case of missing traces.

APA, Harvard, Vancouver, ISO, and other styles

40

ШВАЧИЧ, Геннадій, Павло ЩЕРБИНА, and Дмитро МОРОЗ. "АГРЕГАЦІЯ ОБЧИСЛЮВАЛЬНИХ КАНАЛІВ НА ОСНОВІ ПЛАТФОРМИ NVIDIA CUDA ДЛЯ РЕЖИМІВ УПРАВЛІННЯ КОМПОНЕНТАМИ ТЕХНОЛОГІЧНИХ СИСТЕМ." Information Technology: Computer Science, Software Engineering and Cyber Security, no. 2 (January 10, 2023): 85–92. http://dx.doi.org/10.32782/it/2022-2-10.

Full text

Abstract:

Сьогодні практика висуває задачі, розв’язування яких відомими стандартними підходами досить часто являє собою значну проблему, вирішити яку можна тільки шляхом застосування багатопроцесорних комп’ютерних технологій. В свою чергу одна з принципових особливостей застосування вказаних технологій зводиться до збільшення продуктивності й швидкодії обчислень. При цьому значна продуктивність обчислень допускає розв’язання багатовимірних задач, а також задач, які потребують значного обсягу процесорного часу. Швидкодія дозволяє ефективно керувати не тільки технологічними процесами, а передбачає і створення передумов для розробки перспективних та новітніх технологічних процесів. Отже, застосування високопродуктивних обчислень на сьогодні є проблемою актуальною та першочерговою. У роботі поставлена мета удосконалення структури і підвищення продуктивності багатопроцесорної обчислювальної системи шляхом агрегації обчислювальних каналів на основі платформи NVIDIA CUDA для режимів управління компонентами технологічних процесів. Запропонований підхід дозволив не лише підвищити ефективність розпаралелювання, але і істотно зменшити час обчислень. У приведеній розробці багатопроцесорної системи «зв'язувалися» дві відеокарти NVIDIA GeForce GTX 1080. Такий підхід спрямовано не лише на істотне збільшення продуктивності обчислень, але і на значне зменшення латентності й істотного розвантаження системної шини. В порівнянні з відомим підходом за рахунок застосування програмно-апаратної архітектури паралельних обчислень корпорації NVIDIA на основі платформи CUDA вдалося на кожному обчислювальному вузлі багатопроцесорної системи збільшити об'єм відеопам'яті на 16 Гб, а також підвищити загальну продуктивність вузла системи на 350 Гфл. Практична цінність проведених досліджень спрямована на розв’язування задачі інтенсифікації сфероїдизуючого відпалу довгомірного сталевого виробу. Безпосередньо технологічний процес термічної обробки металу набуває такі переваги, як висока продуктивність, істотне зниження енергоспоживання та дозволяє здійснювати контроль технологічних параметрів за довжиною та площею перетину металу.

APA, Harvard, Vancouver, ISO, and other styles

41

Krasnov, Mikhail Mikhailovich, and Olga Borisovna Feodoritova. "The use of functional programming library to parallelize on graphics accelerators with CUDA technology." Keldysh Institute Preprints, no. 51 (2022): 1–36. http://dx.doi.org/10.20948/prepr-2022-51.

Full text

Abstract:

Modern graphics accelerators (GPUs) can significantly speed up the execution of numerical tasks. However, porting programs to graphics accelerators is not an easy task, sometimes requiring their almost complete rewriting. CUDA graphics accelerators, thanks to technology developed by NVIDIA, allow you to have a single source code for both conventional processors (CPUs) and CUDA. However, in this single source code, you need to somehow tell the compiler which parts of this code to parallelize on shared memory. The use of the functional programming library developed by the authors allows you to hide the use of one or another parallelization mechanism on shared memory within the library and make the user source code completely independent of the computing device used (CPU or CUDA). This article shows how this can be done.

APA, Harvard, Vancouver, ISO, and other styles

42

Afif, Mouna, Yahia Said, and Mohamed Atri. "Efficient 2D Convolution Filters Implementations on Graphics Processing Unit Using NVIDIA CUDA." International Journal of Image, Graphics and Signal Processing 10, no. 8 (August 8, 2018): 1–8. http://dx.doi.org/10.5815/ijigsp.2018.08.01.

Full text

APA, Harvard, Vancouver, ISO, and other styles

43

Ферцев, Александр Александрович, and A. A. Fertsev. "Ускорение обучения нейронной сети для распознавания изображений с помощью технологии NVIDIA CUDA." Вестник Самарского государственного технического университета. Серия «Физико-математические науки» 1(26) (2012): 183–91. http://dx.doi.org/10.14498/vsgtu990.

Full text

APA, Harvard, Vancouver, ISO, and other styles

44

Rao, Naseem, and Safdar Tanweer. "Performance Analysis of Healthcare data and its Implementation on NVIDIA GPU using CUDA-C." Journal of Drug Delivery and Therapeutics 9, no. 1-s (February 21, 2019): 361–63. http://dx.doi.org/10.22270/jddt.v9i1-s.2447.

Full text

Abstract:

In this paper we show how commodity GPU based data mining can help classify various healthcare data in different groups faster than traditional CPU based systems. In addition such systems are cheaper than various ASIC (Application Specific Integrated Circuits) based solutions. Such faster clustering of data could provide useful insights for making successful decisions in case of emergency and outbreaks. Finally, we present conclusion based on our research done so far. In our work we used NVIDIA GPU for implementing an algorithm for healthcare data classification. Speech dissiliency and stuttering assessment can also be addressed through classification audio/speech samples using ANN, k-NN, SVM etc4. Such a faster and economical way to get such insights is of paramount importance. Specifically as a proof-of-concept we have implement k-means algorithm on health care related data set. Keywords: NVIDIA; GPU; ECG; CPU; ANN.

APA, Harvard, Vancouver, ISO, and other styles

45

Choi, Hyeonseong, and Jaehwan Lee. "Efficient Use of GPU Memory for Large-Scale Deep Learning Model Training." Applied Sciences 11, no. 21 (November 4, 2021): 10377. http://dx.doi.org/10.3390/app112110377.

Full text

Abstract:

To achieve high accuracy when performing deep learning, it is necessary to use a large-scale training model. However, due to the limitations of GPU memory, it is difficult to train large-scale training models within a single GPU. NVIDIA introduced a technology called CUDA Unified Memory with CUDA 6 to overcome the limitations of GPU memory by virtually combining GPU memory and CPU memory. In addition, in CUDA 8, memory advise options are introduced to efficiently utilize CUDA Unified Memory. In this work, we propose a newly optimized scheme based on CUDA Unified Memory to efficiently use GPU memory by applying different memory advise to each data type according to access patterns in deep learning training. We apply CUDA Unified Memory technology to PyTorch to see the performance of large-scale learning models through the expanded GPU memory. We conduct comprehensive experiments on how to efficiently utilize Unified Memory by applying memory advises when performing deep learning. As a result, when the data used for deep learning are divided into three types and a memory advise is applied to the data according to the access pattern, the deep learning execution time is reduced by 9.4% compared to the default Unified Memory.

APA, Harvard, Vancouver, ISO, and other styles

46

ROBERGE, VINCENT, and MOHAMMED TARBOUCHI. "COMPARISON OF PARALLEL PARTICLE SWARM OPTIMIZERS FOR GRAPHICAL PROCESSING UNITS AND MULTICORE PROCESSORS." International Journal of Computational Intelligence and Applications 12, no. 01 (March 2013): 1350006. http://dx.doi.org/10.1142/s1469026813500065.

Full text

Abstract:

In this paper, we present a parallel implementation of the particle swarm optimization (PSO) on graphical processing units (GPU) using CUDA. By fully utilizing the processing power of graphic processors, our implementation (CUDA-PSO) provides a speedup of 167× compared to a sequential implementation on CPU. This speedup is significantly superior to what has been reported in recent papers and is achieved by four optimizations we made to better adapt the parallel algorithm to the specific architecture of the NVIDIA GPU. However, because today's personal computers are usually equipped with a multicore CPU, it may be unfair to compare our CUDA implementation to a sequential one. For this reason, we implemented a parallel PSO for multicore CPUs using MPI (MPI-PSO) and compared its performance against our CUDA-PSO. The execution time of our CUDA-PSO remains 15.8× faster than our MPI-PSO which ran on a high-end 12-core workstation. Moreover, we show with statistical significance that the results obtained using our CUDA-PSO are of equal quality as the results obtained by the sequential PSO or the MPI-PSO. Finally, we use our parallel PSO for real-time harmonic minimization of multilevel power inverters with 20 DC sources while considering the first 100 harmonics and show that our CUDA-PSO is 294× faster than the sequential PSO and 32.5× faster than our parallel MPI-PSO.

APA, Harvard, Vancouver, ISO, and other styles

47

Koprawi, Muhammad. "Parallel Computation in Uncompressed Digital Images Using Computer Unified Device Architecture and Open Computing Language." PIKSEL : Penelitian Ilmu Komputer Sistem Embedded and Logic 8, no. 1 (March 20, 2020): 31–38. http://dx.doi.org/10.33558/piksel.v8i1.2017.

Full text

Abstract:

In general, a computer program will execute instructions serially. These instructions will be run on the CPU or referred to as serial computing. But when computing is run in large numbers, the time required by serial computing becomes very long. Therefore, we need another computation that can streamline data processing time such as parallel computing. Parallel computing can be done on GPUs (Graphical Processing Units) that are run with the help of toolkits such as CUDA (Computer Unified Device Architecture) and OpenCL (Open Computing Language). CUDA can only be run on NVIDIA graphics cards, while OpenCL can be run on all types of graphics cards. This research will compare the performance of parallel computing time between CUDA and OpenCL tested on uncompressed digital images. The digital image tested has several different sizes. The results of the study are expected to be a reference for digital image processing methods.

APA, Harvard, Vancouver, ISO, and other styles

48

Li, Xiao Ping, Hong Ming Zhang, and Xiao Xu Zhang. "Lattice Boltzman Simulations of Cavity Flow Using CUDA." Applied Mechanics and Materials 444-445 (October 2013): 316–19. http://dx.doi.org/10.4028/www.scientific.net/amm.444-445.316.

Full text

Abstract:

GPGPU has drawn much attention on accelerating non-graphic applications.A new algorithm on the numerical simulation of Lattice-Boltzmann method (LBM) based on CUDA is studied.The cavity flow is simulated by D2Q9 model of LBM method ,with the non-equilibrium extrapolation method for velocity boundary to deal the wall boundary conditions and using global memory and texture memory to store data.In the model the 9 distribution functions were all stored in the form of two-dimensional grid, each grid is assigned a thread and each thread block includes 256 threads.The simulation for cavity flow with LBM was carried out by CUDA and NVIDIA GeForce 8600 GT on a PC.The speed is more than 15 times faster than that of the CPU.

APA, Harvard, Vancouver, ISO, and other styles

49

Jin, Nai Gao, Fei Mo Li, and Zhao Xing Li. "Quasi-Monte Carlo Gaussian Particle Filtering Acceleration Using CUDA." Applied Mechanics and Materials 130-134 (October 2011): 3311–15. http://dx.doi.org/10.4028/www.scientific.net/amm.130-134.3311.

Full text

Abstract:

A CUDA accelerated Quasi-Monte Carlo Gaussian particle filter (QMC-GPF) is proposed to deal with real-time non-linear non-Gaussian problems. GPF is especially suitable for parallel implementation as a result of the elimination of resampling step. QMC-GPF is an efficient counterpart of GPF using QMC sampling method instead of MC. Since particles generated by QMC method provides the best-possible distribution in the sampling space, QMC-GPF can make more accurate estimation with the same number of particles compared with traditional particle filter. Experimental results show that our GPU implementation of QMC-GPF can achieve the maximum speedup ratio of 95 on NVIDIA GeForce GTX 460.

APA, Harvard, Vancouver, ISO, and other styles

50

Wang, Song, Shan Liang Yang, and Ge Li. "Study of Accelerating Infrared Imaging Simulation Based on CUDA." Applied Mechanics and Materials 651-653 (September 2014): 2045–49. http://dx.doi.org/10.4028/www.scientific.net/amm.651-653.2045.

Full text

Abstract:

This paper builds an infrared scene of sphere target based on JAMSE, which provides EO/IR environment and is suite to build infrared imaging simulation system of engineering and engagement-level. In addition, to speed up this infrared imaging simulation, we analyzed the process of external rendering mode, which is applied in JMAES EO/IR environment, and found the external rendering image compounding is a highly independently process, which is suite to parallel computing. After testing on NVIDIA TESLA C2075 GPU with CUDA, and comparing the performance with the corresponding sequentialprocess on CPU, we got a satisfied result. This process obtains a speed up of over 10.

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!