Log in

Relevant bibliographies by topics / Nvidia CUDA / Dissertations / Theses

To see the other types of publications on this topic, follow the link: Nvidia CUDA.

Dissertations / Theses on the topic 'Nvidia CUDA'

Author: Grafiati

Published: 4 June 2021

Last updated: 6 September 2023

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'Nvidia CUDA.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Zajíc, Jiří. "Překladač jazyka C# do jazyka Nvidia CUDA." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2012. http://www.nusl.cz/ntk/nusl-236439.

Full text

Abstract:

This master's thesis is focused on GPU accelerated calculations on NVidia graphics card. CUDA technology is used and converted to implementation on a .NET platform. The problem is solved as a compiler from C# programing language to NVidia CUDA language with expression atrributes of C# language that preserves the same semantics of actions. Application is implemented in C# programing language and uses NRefactory, the open-source library.

APA, Harvard, Vancouver, ISO, and other styles

2

Savioli, Nicolo'. "Parallelization of the algorithm WHAM with NVIDIA CUDA." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2013. http://amslaurea.unibo.it/6377/.

Full text

Abstract:

The aim of my thesis is to parallelize the Weighting Histogram Analysis Method (WHAM), which is a popular algorithm used to calculate the Free Energy of a molucular system in Molecular Dynamics simulations. WHAM works in post processing in cooperation with another algorithm called Umbrella Sampling. Umbrella Sampling has the purpose to add a biasing in the potential energy of the system in order to force the system to sample a specific region in the configurational space. Several N independent simulations are performed in order to sample all the region of interest. Subsequently, the WHAM algorithm is used to estimate the original system energy starting from the N atomic trajectories. The parallelization of WHAM has been performed through CUDA, a language that allows to work in GPUs of NVIDIA graphic cards, which have a parallel achitecture. The parallel implementation may sensibly speed up the WHAM execution compared to previous serial CPU imlementations. However, the WHAM CPU code presents some temporal criticalities to very high numbers of interactions. The algorithm has been written in C++ and executed in UNIX systems provided with NVIDIA graphic cards. The results were satisfying obtaining an increase of performances when the model was executed on graphics cards with compute capability greater. Nonetheless, the GPUs used to test the algorithm is quite old and not designated for scientific calculations. It is likely that a further performance increase will be obtained if the algorithm would be executed in clusters of GPU at high level of computational efficiency. The thesis is organized in the following way: I will first describe the mathematical formulation of Umbrella Sampling and WHAM algorithm with their apllications in the study of ionic channels and in Molecular Docking (Chapter 1); then, I will present the CUDA architectures used to implement the model (Chapter 2); and finally, the results obtained on model systems will be presented (Chapter 3).

APA, Harvard, Vancouver, ISO, and other styles

3

Ikeda, Patricia Akemi. "Um estudo do uso eficiente de programas em placas gráficas." Universidade de São Paulo, 2011. http://www.teses.usp.br/teses/disponiveis/45/45134/tde-25042012-212956/.

Full text

Abstract:

Inicialmente projetadas para processamento de gráficos, as placas gráficas (GPUs) evoluíram para um coprocessador paralelo de propósito geral de alto desempenho. Devido ao enorme potencial que oferecem para as diversas áreas de pesquisa e comerciais, a fabricante NVIDIA destaca-se pelo pioneirismo ao lançar a arquitetura CUDA (compatível com várias de suas placas), um ambiente capaz de tirar proveito do poder computacional aliado à maior facilidade de programação. Na tentativa de aproveitar toda a capacidade da GPU, algumas práticas devem ser seguidas. Uma delas consiste em manter o hardware o mais ocupado possível. Este trabalho propõe uma ferramenta prática e extensível que auxilie o programador a escolher a melhor configuração para que este objetivo seja alcançado.
Initially designed for graphical processing, the graphic cards (GPUs) evolved to a high performance general purpose parallel coprocessor. Due to huge potencial that graphic cards offer to several research and commercial areas, NVIDIA was the pioneer lauching of CUDA architecture (compatible with their several cards), an environment that take advantage of computacional power combined with an easier programming. In an attempt to make use of all capacity of GPU, some practices must be followed. One of them is to maximizes hardware utilization. This work proposes a practical and extensible tool that helps the programmer to choose the best configuration and achieve this goal.

APA, Harvard, Vancouver, ISO, and other styles

4

Rivera-Polanco, Diego Alejandro. "COLLECTIVE COMMUNICATION AND BARRIER SYNCHRONIZATION ON NVIDIA CUDA GPU." Lexington, Ky. : [University of Kentucky Libraries], 2009. http://hdl.handle.net/10225/1158.

Full text

Abstract:

Thesis (M.S.)--University of Kentucky, 2009.
Title from document title page (viewed on May 18, 2010). Document formatted into pages; contains: ix, 88 p. : ill. Includes abstract and vita. Includes bibliographical references (p. 86-87).

APA, Harvard, Vancouver, ISO, and other styles

5

Harvey, Jesse Patrick. "GPU acceleration of object classification algorithms using NVIDIA CUDA /." Online version of thesis, 2009. http://hdl.handle.net/1850/10894.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

Lerchundi, Osa Gorka. "Fast Implementation of Two Hash Algorithms on nVidia CUDA GPU." Thesis, Norwegian University of Science and Technology, Department of Telematics, 2009. http://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-9817.

Full text

Abstract:

User needs increases as time passes. We started with computers like the size of a room where the perforated plaques did the same function as the current machine code object does and at present we are at a point where the number of processors within our graphic device unit its not enough for our requirements. A change in the evolution of computing is looming. We are in a transition where the sequential computation is losing ground on the benefit of the distributed. And not because of the birth of the new GPUs easily accessible this trend is novel but long before it was used for projects like SETI@Home, fightAIDS@Home, ClimatePrediction and there were shouting from the rooftops about what was to come. Grid computing was its formal name. Until now it was linked only to distributed systems over the network, but as this technology evolves it will take different meaning. nVidia with CUDA has been one of the first companies to make this kind of software package noteworthy. Instead of being a proof of concept its a real tool. Where the transition is expressed in greater magnitude in which the true artist is the programmer who uses it and achieves performance increases. As with many innovations, a community distributed worldwide has grown behind this software package and each one doing its bit. It is noteworthy that after CUDA release a lot of software developments grown like the cracking of the hitherto insurmountable WPA. With Sony-Toshiba-IBM (STI) alliance it could be said the same thing, it has a great community and great software (IBM is the company in charge of maintenance). Unlike nVidia is not as accessible as it is but IBM is powerful enough to enter home made supercomputing market. In this case, after IBM released the PS3 SDK, a notorious application was created using the benefits of parallel computing named Folding@Home. Its purpose is to, inter alia, find the cure for cancer. To sum up, this is only the beginning, and in this thesis is sized up the possibility of using this technology for accelerating cryptographic hash algorithms. BLUE MIDNIGHT WISH (The hash algorithm that is applied to the surgery) is undergone to an environment change adapting it to a parallel capable code for creating empirical measures that compare to the current sequential implementations. It will answer questions that nowadays havent been answered yet. BLUE MIDNIGHT WISH is a candidate hash function for the next NIST standard SHA-3, designed by professor Danilo Gligoroski from NTNU and Vlastimil Klima an independent cryptographer from Czech Republic. So far, from speed point of view BLUE MIDNIGHT WISH is on the top of the charts (generally on the second place right behind EDON-R - another hash function from professor Danilo Gligoroski). One part of the work on this thesis was to investigate is it possible to achieve faster speeds in processing of Blue Midnight Wish when the computations are distributed among the cores in a CUDA device card. My numerous experiments give a clear answer: NO. Although the answer is negative, it still has a significant scientific value. The point is that my work acknowledges viewpoints and standings of a part of the cryptographic community that is doubtful that the cryptographic primitives will benefit when executed in parallel in many cores in one CPU. Indeed, my experiments show that the communication costs between cores in CUDA outweigh by big margin the computational costs done inside one core (processor) unit.

APA, Harvard, Vancouver, ISO, and other styles

7

Virk, Bikram. "Implementing method of moments on a GPGPU using Nvidia CUDA." Thesis, Georgia Institute of Technology, 2010. http://hdl.handle.net/1853/33980.

Full text

Abstract:

This thesis concentrates on the algorithmic aspects of Method of Moments (MoM) and Locally Corrected Nyström (LCN) numerical methods in electromagnetics. The data dependency in each step of the algorithm is analyzed to implement a parallel version that can harness the powerful processing power of a General Purpose Graphics Processing Unit (GPGPU). The GPGPU programming model provided by NVIDIA's Compute Unified Device Architecture (CUDA) is described to learn the software tools at hand enabling us to implement C code on the GPGPU. Various optimizations such as the partial update at every iteration, inter-block synchronization and using shared memory enable us to achieve an overall speedup of approximately 10. The study also brings out the strengths and weaknesses in implementing different methods such as Crout's LU decomposition and triangular matrix inversion on a GPGPU architecture. The results suggest future directions of study in different algorithms and their effectiveness on a parallel processor environment. The performance data collected show how different features of the GPGPU architecture can be enhanced to yield higher speedup.

APA, Harvard, Vancouver, ISO, and other styles

8

Sreenibha, Reddy Byreddy. "Performance Metrics Analysis of GamingAnywhere with GPU accelerated Nvidia CUDA." Thesis, Blekinge Tekniska Högskola, Institutionen för datalogi och datorsystemteknik, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-16846.

Full text

Abstract:

The modern world has opened the gates to a lot of advancements in cloud computing, particularly in the field of Cloud Gaming. The most recent development made in this area is the open-source cloud gaming system called GamingAnywhere. The relationship between the CPU and GPU is what is the main object of our concentration in this thesis paper. The Graphical Processing Unit (GPU) performance plays a vital role in analyzing the playing experience and enhancement of GamingAnywhere. In this paper, the virtualization of the GPU has been concentrated on and is suggested that the acceleration of this unit using NVIDIA CUDA, is the key for better performance while using GamingAnywhere. After vast research, the technique employed for NVIDIA CUDA has been chosen as gVirtuS. There is an experimental study conducted to evaluate the feasibility and performance of GPU solutions by VMware in cloud gaming scenarios given by GamingAnywhere. Performance is measured in terms of bitrate, packet loss, jitter and frame rate. Different resolutions of the game are considered in our empirical research and our results show that the frame rate and bitrate have increased with different resolutions, and the usage of NVIDIA CUDA enhanced GPU.

APA, Harvard, Vancouver, ISO, and other styles

9

Bourque, Donald. "CUDA-Accelerated ORB-SLAM for UAVs." Digital WPI, 2017. https://digitalcommons.wpi.edu/etd-theses/882.

Full text

Abstract:

"The use of cameras and computer vision algorithms to provide state estimation for robotic systems has become increasingly popular, particularly for small mobile robots and unmanned aerial vehicles (UAVs). These algorithms extract information from the camera images and perform simultaneous localization and mapping (SLAM) to provide state estimation for path planning, obstacle avoidance, or 3D reconstruction of the environment. High resolution cameras have become inexpensive and are a lightweight and smaller alternative to laser scanners. UAVs often have monocular camera or stereo camera setups since payload and size impose the greatest restrictions on their flight time and maneuverability. This thesis explores ORB-SLAM, a popular Visual SLAM method that is appropriate for UAVs. Visual SLAM is computationally expensive and normally offloaded to computers in research environments. However, large UAVs with greater payload capacity may carry the necessary hardware for performing the algorithms. The inclusion of general-purpose GPUs on many of the newer single board computers allows for the potential of GPU-accelerated computation within a small board profile. For this reason, an NVidia Jetson board containing an NVidia Pascal GPU was used. CUDA, NVidia’s parallel computing platform, was used to accelerate monocular ORB-SLAM, achieving onboard Visual SLAM on a small UAV. Committee members:"

APA, Harvard, Vancouver, ISO, and other styles

10

Subramoniapillai, Ajeetha Saktheesh. "Architectural Analysis and Performance Characterization of NVIDIA GPUs using Microbenchmarking." The Ohio State University, 2012. http://rave.ohiolink.edu/etdc/view?acc_num=osu1344623484.

Full text

APA, Harvard, Vancouver, ISO, and other styles

11

Nejadfard, Kian. "Context-aware automated refactoring for unified memory allocation in NVIDIA CUDA programs." Cleveland State University / OhioLINK, 2021. http://rave.ohiolink.edu/etdc/view?acc_num=csu1624622944458295.

Full text

APA, Harvard, Vancouver, ISO, and other styles

12

Zaahid, Mohammed. "Performance Metrics Analysis of GamingAnywhere with GPU acceletayed NVIDIA CUDA using gVirtuS." Thesis, Blekinge Tekniska Högskola, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-16852.

Full text

Abstract:

The modern world has opened the gates to a lot of advancements in cloud computing, particularly in the field of Cloud Gaming. The most recent development made in this area is the open-source cloud gaming system called GamingAnywhere. The relationship between the CPU and GPU is what is the main object of our concentration in this thesis paper. The Graphical Processing Unit (GPU) performance plays a vital role in analyzing the playing experience and enhancement of GamingAnywhere. In this paper, the virtualization of the GPU has been concentrated on and is suggested that the acceleration of this unit using NVIDIA CUDA, is the key for better performance while using GamingAnywhere. After vast research, the technique employed for NVIDIA CUDA has been chosen as gVirtuS. There is an experimental study conducted to evaluate the feasibility and performance of GPU solutions by VMware in cloud gaming scenarios given by GamingAnywhere. Performance is measured in terms of bitrate, packet loss, jitter and frame rate. Different resolutions of the game are considered in our empirical research and our results show that the frame rate and bitrate have increased with different resolutions, and the usage of NVIDIA CUDA enhanced GPU.

APA, Harvard, Vancouver, ISO, and other styles

13

Krivoklatský, Filip. "Návrh vestavaného systému inteligentného vidění na platformě NVIDIA." Master's thesis, Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií, 2019. http://www.nusl.cz/ntk/nusl-400627.

Full text

Abstract:

This diploma thesis deals with design of embedded computer vision system and transfer of existing computer vision application for 3D object detection from Windows OS to designed embedded system with Linux OS. Thesis focuses on design of communication interface for system control and camera video transfer through local network with video compression. Then, detection algorithm is enhanced by transferring computationally expensive functions to GPU using CUDA technology. Finally, a user application with graphical interface is designed for system control on Windows platform.

APA, Harvard, Vancouver, ISO, and other styles

14

Loundagin, Justin. "Optimizing Harris Corner Detection on GPGPUs Using CUDA." DigitalCommons@CalPoly, 2015. https://digitalcommons.calpoly.edu/theses/1348.

Full text

Abstract:

ABSTRACT Optimizing Harris Corner Detection on GPGPUs Using CUDA The objective of this thesis is to optimize the Harris corner detection algorithm implementation on NVIDIA GPGPUs using the CUDA software platform and measure the performance benefit. The Harris corner detection algorithm—developed by C. Harris and M. Stephens—discovers well defined corner points within an image. The corner detection implementation has been proven to be computationally intensive, thus realtime performance is difficult with a sequential software implementation. This thesis decomposes the Harris corner detection algorithm into a set of parallel stages, each of which are implemented and optimized on the CUDA platform. The performance results show that by applying strategic CUDA optimizations to the Harris corner detection implementation, realtime performance is feasible. The optimized CUDA implementation of the Harris corner detection algorithm showed significant speedup over several platforms: standard C, MATLAB, and OpenCV. The optimized CUDA implementation of the Harris corner detection algorithm was then applied to a feature matching computer vision system, which showed significant speedup over the other platforms.

APA, Harvard, Vancouver, ISO, and other styles

15

Shaker, Alfred M. "COMPARISON OF THE PERFORMANCE OF NVIDIA ACCELERATORS WITH SIMD AND ASSOCIATIVE PROCESSORS ON REAL-TIME APPLICATIONS." Kent State University / OhioLINK, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=kent1501084051233453.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Araújo, João Manuel da Silva. "Paralelização de algoritmos de Filtragem baseados em XPATH/XML com recurso a GPUs." Master's thesis, FCT - UNL, 2009. http://hdl.handle.net/10362/2530.

Full text

Abstract:

Dissertação de Mestrado em Engenharia Informática
Esta dissertação envolve o estudo da viabilidade da utilização dos GPUs para o processamento paralelo aplicado aos algoritmos de filtragem de notificações num sistema editor/assinante. Este objectivo passou por realizar uma comparação de resultados experimentais entre a versão sequencial (nos CPUs) e a versão paralela de um algoritmo de filtragem escolhido como referência. Essa análise procurou dar elementos para aferir se eventuais ganhos da exploração dos GPUs serão suficientes para compensar a maior complexidade do processo.

APA, Harvard, Vancouver, ISO, and other styles

17

Shi, Bobo. "Implementation and Performance Analysis of Many-body Quantum Chemical Methods on the Intel Xeon Phi Coprocessor and NVIDIA GPU Accelerator." The Ohio State University, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=osu1462793739.

Full text

APA, Harvard, Vancouver, ISO, and other styles

18

Čermák, Michal. "Detekce pohyblivého objektu ve videu na CUDA." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2011. http://www.nusl.cz/ntk/nusl-236992.

Full text

Abstract:

This thesis deals with model-based approach to 3D tracking from monocular video. The 3D model pose dynamically estimated through minimization of objective function by particle filter. Objective function is based on rendered scene to real video similarity.

APA, Harvard, Vancouver, ISO, and other styles

19

Fuksa, Tomáš. "Paralelizace výpočtů pro zpracování obrazu." Master's thesis, Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií, 2011. http://www.nusl.cz/ntk/nusl-219371.

Full text

Abstract:

This work deals with parallel computing on modern processors - multi-core CPU and GPU. The goal is to learn about computing on this devices suitable for parallelization, define their advantages and disadvantages, test their properties in examples and select appropriate tools to implement a library for parallel image processing. This library is going to be used for the vanishing point estimation in the path finding mobile robot.

APA, Harvard, Vancouver, ISO, and other styles

20

Mašek, Jan. "Dynamický částicový systém jako účinný nástroj pro statistické vzorkování." Doctoral thesis, Vysoké učení technické v Brně. Fakulta stavební, 2018. http://www.nusl.cz/ntk/nusl-390276.

Full text

Abstract:

The presented doctoral thesis aims at development a new efficient tool for optimization of uniformity of point samples. One of use-cases of these point sets is the usage as optimized sets of integration points in statistical analyses of computer models using Monte Carlo type integration. It is well known that the pursuit of uniformly distributed sets of integration points is the only possible way of decreasing the error of estimation of an integral over an unknown function. The tasks of the work concern a survey of currently used criteria for evaluation and/or optimization of uniformity of point sets. A critical evaluation of their properties is presented, leading to suggestions towards improvements in spatial and statistical uniformity of resulting samples. A refined variant of the general formulation of the phi optimization criterion has been derived by incorporating the periodically repeated design domain along with a scale-independent behavior of the criterion. Based on a notion of a physical analogy between a set of sampling points and a dynamical system of mutually repelling particles, a hyper-dimensional N-body system has been selected to be the driver of the developed optimization tool. Because the simulation of such a dynamical system is known to be a computationally intensive task, an efficient solution using the massively parallel GPGPU platform Nvidia CUDA has been developed. An intensive study of properties of this complex architecture turned out as necessary to fully exploit the possible solution speedup.

APA, Harvard, Vancouver, ISO, and other styles

21

Senthil, Kumar Nithin. "Designing optimized MPI+NCCL hybrid collective communication routines for dense many-GPU clusters." The Ohio State University, 2021. http://rave.ohiolink.edu/etdc/view?acc_num=osu1619132252608831.

Full text

APA, Harvard, Vancouver, ISO, and other styles

22

Bartosch, Nadine. "Correspondence-based pairwise depth estimation with parallel acceleration." Thesis, Mittuniversitetet, Avdelningen för informationssystem och -teknologi, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:miun:diva-34372.

Full text

Abstract:

This report covers the implementation and evaluation of a stereo vision corre- spondence-based depth estimation algorithm on a GPU. The results and feed- back are used for a Multi-view camera system in combination with Jetson TK1 devices for parallelized image processing and the aim of this system is to esti- mate the depth of the scenery in front of it. The performance of the algorithm plays the key role. Alongside the implementation, the objective of this study is to investigate the advantages of parallel acceleration inter alia the differences to the execution on a CPU which are significant for all the function, the imposed overheads particular for a GPU application like memory transfer from the CPU to the GPU and vice versa as well as the challenges for real-time and concurrent execution. The study has been conducted with the aid of CUDA on three NVIDIA GPUs with different characteristics and with the aid of knowledge gained through extensive literature study about different depth estimation algo- rithms but also stereo vision and correspondence as well as CUDA in general. Using the full set of components of the algorithm and expecting (near) real-time execution is utopic in this setup and implementation, the slowing factors are in- ter alia the semi-global matching. Investigating alternatives shows that results for disparity maps of a certain accuracy are also achieved by local methods like the Hamming Distance alone and by a filter that refines the results. Further- more, it is demonstrated that the kernel launch configuration and the usage of GPU memory types like shared memory is crucial for GPU implementations and has an impact on the performance of the algorithm. Just concurrency proves to be a more complicated task, especially in the desired way of realization. For the future work and refinement of the algorithm it is therefore recommended to invest more time into further optimization possibilities in regards of shared memory and into integrating the algorithm into the actual pipeline.

APA, Harvard, Vancouver, ISO, and other styles

23

Karri, Venkata Praveen. "Effective and Accelerated Informative Frame Filtering in Colonoscopy Videos Using Graphic Processing Units." Thesis, University of North Texas, 2010. https://digital.library.unt.edu/ark:/67531/metadc31536/.

Full text

Abstract:

Colonoscopy is an endoscopic technique that allows a physician to inspect the mucosa of the human colon. Previous methods and software solutions to detect informative frames in a colonoscopy video (a process called informative frame filtering or IFF) have been hugely ineffective in (1) covering the proper definition of an informative frame in the broadest sense and (2) striking an optimal balance between accuracy and speed of classification in both real-time and non real-time medical procedures. In my thesis, I propose a more effective method and faster software solutions for IFF which is more effective due to the introduction of a heuristic algorithm (derived from experimental analysis of typical colon features) for classification. It contributed to a 5-10% boost in various performance metrics for IFF. The software modules are faster due to the incorporation of sophisticated parallel-processing oriented coding techniques on modern microprocessors. Two IFF modules were created, one for post-procedure and the other for real-time. Code optimizations through NVIDIA CUDA for GPU processing and/or CPU multi-threading concepts embedded in two significant microprocessor design philosophies (multi-core design and many-core design) resulted a 5-fold acceleration for the post-procedure module and a 40-fold acceleration for the real-time module. Some innovative software modules, which are still in testing phase, have been recently created to exploit the power of multiple GPUs together.

APA, Harvard, Vancouver, ISO, and other styles

24

Ekstam, Ljusegren Hannes, and Hannes Jonsson. "Parallelizing Digital Signal Processing for GPU." Thesis, Linköpings universitet, Programvara och system, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-167189.

Full text

Abstract:

Because of the increasing importance of signal processing in today's society, there is a need to easily experiment with new ways to process signals. Usually, fast-performing digital signal processing is done with special-purpose hardware that are difficult to develop for. GPUs pose an alternative for fast performing digital signal processing. The work in this thesis is an analysis and implementation of a GPU version of a digital signal processing chain provided by SAAB. Through an iterative process of development and testing, a final implementation was achieved. Two benchmarks, both comprised of 4.2 M test samples, were made to compare the CPU implementation with the GPU implementation. The benchmark was run on three different platforms: a desktop computer, a NVIDIA Jetson AGX Xavier and a NVIDIA Jetson TX2. The results show that the parallelized version can reach several magnitudes higher throughput than the CPU implementation.

APA, Harvard, Vancouver, ISO, and other styles

25

Chehaimi, Omar. "Parallelizzazione dell'algoritmo di ricostruzione di Feldkamp-Davis-Kress per architetture Low-Power di tipo System-On-Chip." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2017. http://amslaurea.unibo.it/13918/.

Full text

Abstract:

In questa tesi,svolta presso il CNAF,si presentano i risultati ottenuti nel lavoro svolto per la parallelizzazione in CUDA dell'algoritmo di ricostruzione tomografica di Feldkamp-Davis-Kress (FDK),sulla base del software in versione sia sequenziale che parallela MPI,sviluppato presso i laboratori del X-ray Imaging Group.Gli obbiettivi di questo lavoro sono principalmente due:ridurre in modo sensibile i tempi di esecuzione dell'algoritmo di ricostruzione FDK parallelizzando su Graphics Processing Unit (GPU) e valutare,su diverse tipologie di architetture,i consumi energetici.Le piattaforme prese in esame sono:SoC (System-on-Chip) low-power, architetture a basso consumo energetico ma a limitata potenza di calcolo,e High Performance Computing (HPC),caratterizzate da un'elevata potenza di calcolo ma con un ingente consumo energetico.Si vuole mettere in risalto la differenza di prestazioni in relazione al tipo di architettura e rispetto al relativo consumo energetico.Poter sostituire nodi HPC con schede SoC low-power presenta il vantaggio di ridurre i consumi, la complessità dell'hardware e la possibilità di ottenere dei risultati direttamente in loco.I risultati ottenuti mostrano che la parallelizzazione di FDK su GPU sia la scelta più efficiente. Risulta infatti sempre,e su ogni architettura testata,più performante rispetto alla versione MPI,nonostante in quest'ultima venga parallelizzato tutto l'algoritmo.In CUDA invece si parallelizza solo la fase di ricostruzione.Inoltre si è risusciti a raggiungere un'efficienza di utilizzo della GPU del 100%.L'efficienza energetica rapportata alle prestazioni in termini di tempo è migliore per le architetture SoC rispetto a quelle HPC.Si propone infine un approccio ibrido MPI unito a CUDA che migliora ulteriormente le prestazioni di esecuzione.Il filtraggio e la ricostruzione sono operazioni indipendenti,si utilizza allora l'implementazione più efficiente per la data operazione,filtrare in MPI e ricostruire in CUDA.

APA, Harvard, Vancouver, ISO, and other styles

26

Mintěl, Tomáš. "Interpolace obrazových bodů." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2009. http://www.nusl.cz/ntk/nusl-236736.

Full text

Abstract:

This master's thesis deals with acceleration of pixel interpolation methods using the GPU and NVIDIA (R) CUDA TM architecture. Graphic output is represented by a demonstrational application for geometrical image transforms using chosen interpolation method. Time critical parts of the code are moved on the GPU and executed in parallel. There are used highly optimized routines from the OpenCV library, made by the Intel company for an image and video processing.

APA, Harvard, Vancouver, ISO, and other styles

27

Němeček, Petr. "Geometrické transformace obrazu." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2009. http://www.nusl.cz/ntk/nusl-236764.

Full text

Abstract:

This master's thesis deals with acceleration of geometrical image transforms using the GPU and NVIDIA (R) CUDA TM architecture. Time critical parts of the code are moved on the GPU and executed in parallel. One of the results is a demonstrational application for performance comparison of both architectures: the CPU, and GPU in combination with the CPU. As a reference implementation, there are used highly optimized routines from the OpenCV library, made by the Intel company.

APA, Harvard, Vancouver, ISO, and other styles

28

Hordemann, Glen J. "Exploring High Performance SQL Databases with Graphics Processing Units." Bowling Green State University / OhioLINK, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=bgsu1380125703.

Full text

APA, Harvard, Vancouver, ISO, and other styles

29

Music, Sani. "Grafikkort till parallella beräkningar." Thesis, Malmö högskola, Fakulteten för teknik och samhälle (TS), 2012. http://urn.kb.se/resolve?urn=urn:nbn:se:mau:diva-20150.

Full text

Abstract:

Den här studien beskriver hur grafikkort kan användas på en bredare front änmultimedia. Arbetet förklarar och diskuterar huvudsakliga alternativ som finnstill att använda grafikkort till generella operationer i dagsläget. Inom denna studieanvänds Nvidias CUDA arkitektur. Studien beskriver hur grafikkort användstill egna operationer rent praktiskt ur perspektivet att vi redan kan programmerai högnivåspråk och har grundläggande kunskap om hur en dator fungerar. Vianvänder s.k. accelererade bibliotek på grafikkortet (THRUST och CUBLAS) föratt uppnå målet som är utveckling av programvara och prestandatest. Resultatetär program som använder GPU:n till generella och prestandatest av dessa,för lösning av olika problem (matrismultiplikation, sortering, binärsökning ochvektor-inventering) där grafikkortet jämförs med processorn seriellt och parallellt.Resultat visar att grafikkortet exekverar upp till ungefär 50 gånger snabbare(tidsmässigt) kod jämfört med seriella program på processorn.
This study describes how we can use graphics cards for general purpose computingwhich differs from the most usual field where graphics cards are used, multimedia.The study describes and discusses present day alternatives for usinggraphic cards for general operations. In this study we use and describe NvidiaCUDA architecture. The study describes how we can use graphic cards for generaloperations from the point of view that we have programming knowledgein some high-level programming language and knowledge of how a computerworks. We use accelerated libraries (THRUST and CUBLAS) to achieve our goalson the graphics card, which are software development and benchmarking. Theresults are programs countering certain problems (matrix multiplication, sorting,binary search, vector inverting) and the execution time and speedup forthese programs. The graphics card is compared to the processor in serial andthe processor in parallel. Results show a speedup of up to approximatly 50 timescompared to serial implementations on the processor.

APA, Harvard, Vancouver, ISO, and other styles

30

Farabegoli, Nicolas. "Implementazione ottimizata dell'operatore di Dirac su GPGPU." Bachelor's thesis, Alma Mater Studiorum - Università di Bologna, 2020. http://amslaurea.unibo.it/20356/.

Full text

Abstract:

Nelle applicazioni Lattice QCD l'operatore di Dirac rappresenta una delle principali operazioni, ottimizzarne l'efficienza si riflette in un incremento delle prestazioni globali dell'algoritmo. In tal senso i Tensor Core rappresentano una soluzione che incrementa le prestazioni del calcolo dell'operatore di Dirac ottimizzando in particolare la moltiplicazione tra matrici e vettori. Si è analizzata nel dettaglio l'architettura dei Tensor Core studiando il modello di esecuzione e il layout della memoria. Sono quindi state formulate e analizzate in dettaglio alcune soluzioni che sfruttano i Tensor Core per accelerare l'operatore di Dirac.

APA, Harvard, Vancouver, ISO, and other styles

31

Macenauer, Pavel. "Detekce objektů na GPU." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2015. http://www.nusl.cz/ntk/nusl-234942.

Full text

Abstract:

This thesis addresses the topic of object detection on graphics processing units. As a part of it, a system for object detection using NVIDIA CUDA was designed and implemented, allowing for realtime video object detection and bulk processing. Its contribution is mainly to study the options of NVIDIA CUDA technology and current graphics processing units for object detection acceleration. Also parallel algorithms for object detection are discussed and suggested.

APA, Harvard, Vancouver, ISO, and other styles

32

Straňák, Marek. "Raytracing na GPU." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2011. http://www.nusl.cz/ntk/nusl-237020.

Full text

Abstract:

Raytracing is a basic technique for displaying 3D objects. The goal of this thesis is to demonstrate the possibility of implementing raytracer using a programmable GPU. The algorithm and its modified version, implemented using "C for CUDA" language, are described. The raytracer is focused on displaying dynamic scenes. For this purpose the KD tree structure, bounding volume hierarchies and PBO transfer are used. To achieve realistic output, photon mapping was implemented.

APA, Harvard, Vancouver, ISO, and other styles

33

Artico, Fausto. "Performance Optimization Of GPU ELF-Codes." Doctoral thesis, Università degli studi di Padova, 2014. http://hdl.handle.net/11577/3424532.

Full text

Abstract:

GPUs (Graphic Processing Units) are of interest for their favorable ratio $\frac{GF/s}{price}$. Compared to the beginning - early 1980's - nowadays GPU architectures are more similar to general purpose architectures but with (much) larger numbers of cores - the GF100 architecture released by NVIDIA in 2009-2010, for example, has a true hardware cache hierarchy, a unified memory address space, double precision performance and has a maximum of 512 cores. Exploiting the computational power of GPUs for non-graphics applications - past or present - has, however, always been hard. Initially, in the early 2000's, the way to program GPUs was by using graphic libraries API's (exclusively), which made writing non-graphics codes non-trivial and tedious at best, and virtually impossible in the worst case. In 2003, the Brook compiler and runtime system was introduced, giving users the ability to generate GPU code from a high level programming language. In 2006 NVIDIA introduced CUDA (Compute Unified Device Architecture). CUDA, a parallel computing platform and programming model specifically developed by NVIDIA for its GPUs, attempts to further facilitate general purpose programming of GPUs. Code edited using CUDA is portable between different NVIDIA GPU architectures and this is one of the reasons because NVIDIA claims that the user's productivity is much higher than previous solutions, however optimizing GPU code for utmost performance remains very hard, especially for NVIDIA GPUs using the GF100 architecture - e.g., Fermi GPUs and some Tesla GPUs - because a) the real instruction set architecture (ISA) is not publicly available, b) the code of the NVIDIA compiler - nvcc - is not open and c) users can not edit code using the real assembly - ELF in NVIDIA parlance. Compilers, while enabling immense increases in programmer productivity, by eliminating the need to code at the (tedious) assembly level, are incapable of achieving, to date, performance similar to that of an expert assembly programmer with good knowledge of the underlying architecture. In fact, it is widely accepted that high-level language programming and compiling even with a state-of-the-art compilers loose, on average, a factor of 3 in performance - and sometimes much more - over what a good assembly programmer could achieve, and that even on a conventional, simple, single-core machine. Compilers for more complex machines, such as NVIDIA GPUs, are likely to do much worse because among other things, they face (even more) complex trade-offs between often undecidable and NP-hard problems. However, because NVIDIA a) makes it virtually impossible to gain access to the actual assembly language used by its GF100 architecture, b) does not publicly explain many of the internal mechanisms implemented in its compiler - nvcc - and c) makes it virtually impossible to learn the details of its very complex GF100 architecture in sufficient detail to be able to exploit them, obtaining an estimate of the performance difference between CUDA programming and machine-level programming for NVIDIA GPUs using the GF100 architecture - let alone achieving some a priori performance guarantees of shortest execution time - has been, prior to this current work, impossible. To optimize GPU code, users have to use CUDA or PTX (Parallel Thread Execution) - a virtual instruction set architecture. The CUDA or PTX files are given in input to nvcc that produces as output fatbin files. The fatbin files are produced considering the target GPU architecture selected by the user - this is done setting a flag used by nvcc. In a fatbin file, zero or more parts of the fatbin file will be executed by the CPU - think of these parts as the C/C++ parts - while the remaining parts of the fatbin file - think of these parts as the ELF parts - will be executed by the specific model of the GPU for which the CUDA or PTX file has been compiled. The fatbin files are usually very different from the corresponding CUDA or PTX files and this lack of control can completely ruin any effort made at CUDA or PTX level to optimize the ELF part/parts of the fatbin file that will be executed by the target GPU for which the fatbin file has been compiled. We therefore reverse engineer the real ISA used by the GF100 architecture and generate a set of editing guidelines to force nvcc to generate fatbin files with at least the minimum number of resources later necessary to modify them to get the wanted ELF algorithmic implementations - this gives control on the ELF code that is executed by any GPU using the GF100 architecture. During the process of reverse engineering we also discover all the correspondences between PTX instructions and ELF instructions - a single PTX instruction can be transformed in one or more ELF instructions - and the correspondences between PTX registers and ELF registers. Our procedure is completely repeatable for any NVIDIA Kepler GPU - we do not need to rewrite our code. Being able to get the wanted ELF algorithmic implementations is not enough to optimize the ELF code of a fatbin file, we need in fact also to discover, understand, and quantify some not disclosed GPU behaviors that could slow down the execution of ELF code. This is necessary to understand how to execute the optimization process and while we can not report here all the results we have got, we can however say that we will explain to the reader a) how to force even distributions of the GPU thread blocks to the streaming multiprocessors, b) how we have discovered and quantified several warp scheduling phenomenons, c) how to avoid phenomenons of warp scheduling load unbalancing, that it is not possible to control, in the streaming multiprocessors, d) how we have determined, for each ELF instruction, the minimum quantity of time that it is necessary to wait before a warp scheduler can schedule again a warp - yes, the quantity of time can be different for different ELF instructions - e) how we have determined the time that it is necessary to wait before to be able to read again the data in a register previously read or written - this too can be different for different ELF instructions and different whether the data has been previously read or written - and f) how we have discovered the presence of an overhead time for the management of the warps that does not grow linearly to a liner increase of the number of residents warps in a streaming multiprocessor. Next we explain a) the procedures of transformation that it is necessary to apply to the ELF code of a fatbin file to optimize the ELF code and so making its execution time as short as possible, b) why we need to classify the fatbin files generated from the original fatbin file during the process of optimization and how we do this using several criteria that as final result allow us to determine the positions, occupied by each one of the fatbin files generated, in a taxonomy that we have created, c) how using the position of a fatbin file in the taxonomy we determine whether the fatbin file is eligible for an empirical analysis - that we explain - a theoretical analysis or both, and d) how - if the fatbin file is eligible for a theoretical analysis - we execute the theoretical analysis that we have devised and give an a priori - without any previous execution of the fatbin file - shortest ELF code execution time guarantee - this if the fatbin file satisfies all the requirements of the theoretical analysis - for the ELF code of the fatbin file that will be executed by the target GPU for which the fatbin file has been compiled.
GPUs (Graphic Processing Units) sono di interesse per il loro favorevole rapporto $\frac{GF/s}{price}$. Rispetto all'inizio - primi anni 70 - oggigiorno le architectture GPU sono più simili ad architectture general purpose ma hanno un numero (molto) più grande di cores - la architecttura GF100 rilasciata da NVIDIA durante il 2009-2010, per esempio, ha una vera gerarchia di memoria cache, uno spazio unificato per l'indirizzamento in memoria, è in grado di eseguire calcoli in doppia precisione ed ha un massimo 512 core. Sfruttare la potenza computazionale delle GPU per applicazioni non grafiche - passate o presenti - è, comunque, sempre stato difficile. Inizialmente, nei primi anni 2000, la programmazione su GPU avveniva (esclusivamente) attraverso l'uso librerie grafiche, le quali rendevano la scrittura di codici non grafici non triviale e tediosa al meglio, e virtualmente impossibile al peggio. Nel 2003, furono introdotti il compilatore e il sistema runtime Brook che diedero agli utenti l'abilità di generare codice GPU da un linguaggio di programmazione ad alto livello. Nel 2006 NVIDIA introdusse CUDA (Compute Unified Device Architecture). CUDA, un modello di programmazione e computazione parallela specificamente sviluppato da NVIDIA per le sue GPUs, tenta di facilitare ulteriormente la programmazione general purpose di GPU. Codice scritto in CUDA è portabile tra differenti architectture GPU della NVIDIA e questa è una delle ragioni perché NVIDIA afferma che la produttività degli utenti è molto più alta di precedenti soluzioni, tuttavia ottimizare codice GPU con l'obbiettivo di ottenere le massime prestazioni rimane molto difficile, specialmente per NVIDIA GPUs che usano l'architecttura GF100 - per esempio, Fermi GPUs e delle Tesla GPUs - perché a) il vero instruction set architecture (ISA) è non pubblicamente disponibile, b) il codice del compilatore NVIDIA - nvcc - è non aperto e c) gli utenti non possono scrivere codice usando il vero assembly - ELF nel gergo della NVIDIA. I compilatori, mentre permettono un immenso incremento della produttività di un programmatore, eliminando la necessità di codificare al (tedioso) livello assembly, sono incapaci di ottenere, a questa data, prestazioni simili a quelle di un programmatore che è esperto in assembly ed ha una buona conoscenza dell'architettura sottostante. Infatti, è largamente accettato che programmazione ad alto livello e compilazione perfino con compilatori che sono considerati allo stato dell'arte perdono, in media, un fattore 3 in prestazione - e a volte molto di più - nei confronti di cosa un buon programmatore assembly potrebbe ottenere, e questo perfino su una macchina convenzionale, semplice, a singolo core. Compilatori per macchine più complesse, come le GPU NVIDIA, sono propensi a fare molto peggio perché tra le altre cose, essi devono determinare (persino più) complessi trade-offs durante la ricerca di soluzioni a problemi spesso indecidibili e NP-hard. Peraltro, perché NVIDIA a) rende virtualmente impossibile guadagnare accesso all'attuale linguaggio assembly usato dalla architettura GF100, b) non spiega pubblicamente molti dei meccanismi interni implementati nel suo compilatore - nvcc - e c) rende virtualmente impossible imparare i dettagli della molto complessa architecttura GF100 ad un sufficiente livello di dettaglio che permetta di sfruttarli, ottenere una stima delle differenze prestazionali tra programmazione in CUDA e programmazione a livello macchina per GPU NVIDIA che usano la architecttura GF100 - per non parlare dell'ottenimento a priori di garanzie di tempo di esecuzione più breve - è stato, prima di questo corrente lavoro, impossbile. Per ottimizare codice GPU, gli utenti devono usare CUDA or PTX (Parallel Thread Execution) - un instruction set architecture virtuale. I file CUDA or PTX sono dati in input a nvcc che produce come output fatbin file. I fatbin file sono prodotti considerando l'architecttura GPU selezionata dall'utente - questo è fatto settando un flag usato da nvcc. In un fatbin file, zero o più parti del fatbin file saranno eseguite dalla CPU - pensa a queste parti come le parti C/C++ - mentre le rimanenti parti del fatbin file - pensa a queste parti come le parti ELF - saranno eseguite dallo specifico modello GPU per il quale i file CUDA or PTX sono stati compilati. I fatbin file sono normalmente molto differenti dai corrispodenti file CUDA o PTX e questa assenza di controllo può completamente rovinare qualsiasi sforzo fatto a livello CUDA o PTX per otimizzare la parte o le parti ELF del fatbin file che sarà eseguita / saranno eseguite dalla GPU per la quale il fatbin file è stato compilato. Noi quindi scopriamo quale è il vero ISA usato dalla architettura GF100 e generiamo un insieme di linea guida per scrivere codice in modo tale da forzare nvcc a generare fatbin file con almeno il minimo numero di risorse successivamente necessario per modificare i fatbin file per ottenere le volute implementazioni algoritmiche in ELF - questo da controllo sul codice ELF che è eseguito da qualsiasi GPU che usa l'architettura GF100. Durante il processo di scoperata del vero ISA scopriamo anche le corrispondenze tra istruzioni PTX e istruzioni ELF - una singola istructione PTX può essere transformata in one o più istruzioni ELF - e le corrispondenze tra registri PTX e registri ELF. La nostra procedura è completamente ripetibile per ogni NVIDIA Kepler GPU - non occorre che riscrivamo il nostro codice. Essere in grado di ottenere le volute implementazioni algoritmiche in ELF non è abbastanza per ottimizzare il codice ELF di un fatbin file, ci occorre infatti anche scoprire, comprendere e quantificare dei comportamenti GPU che non sono divulgati e che potrebbero rallentare l'esecuzione di codice ELF. Questo è necessario per comprendere come eseguire il processo di ottimizzazione e mentre noi non possiamo riportare qui tutti i risultati che abbiamo ottenuto, noi possiamo comunque dire che spiegheremo al lettore a) come forzare una distribuzione uniforme dei GPU thread blocks agli streaming multiprocessors, b) come abbiamo scoperto e quantificato diversi fenomeni riguardanti il warp scheduling, c) come evitare fenomeni di warp scheduling load unblanacing, che è non possible controllare, negli streaming multiprocessors, d) come abbiamo determinato, per ogni istruzione ELF, la minima quantità di tempo che è necessario attendere prima che un warp scheduler possa schedulare ancora un warp - si, la quantità di tempo può essere differente per differenti istruzioni ELF - e) come abbiamo determinato il tempo che è necessario attendere prima di essere in grado di leggere ancora un dato in un registro precedentemente letto o scritto - questo pure può essere differente per differnti istruzioni ELF e differente se il dato è stato precedentemente letto o scritto - e f) come abbiamo scoperto la presenza di un tempo di overhead per la gestione dei warp che non cresce linearmente ad un incremento lineare del numero di warp residenti in uno streaming multiprocessor. Successivamente, noi spiegamo a) le procedure di trasformazione che è necessario applicare al codice ELF di un fatbin file per ottimizzare il codice ELF e così rendere il suo tempo di esecuzione il più corto possibile, b) perché occorre classificare i fatbin file generati dal fatbin file originale durante il processo di ottimizzazione e come noi facciamo questo usando diversi criteri che come risultato finale permettono a noi di determinare le posizioni, occupate da ogni fatbin file generato, in una tassonomia che noi abbiamo creato, c) come usando la posizione di un fatbin file nella tassonomia noi determiniamo se il fatbin file è qualificato per una analisi empirica - che noi spieghiamo - una analisi teorica o entrambe and d) come - supponendo il fatbin file sia qualificato per una analisi teorica - noi eseguiamo l'analisi teorica che abbiamo ideato e diamo a priori - senza alcuna precedente esecuzione del fatbin file - la garanzia - questo supponendo il fatbin file soddisfi tutti i requisiti dell'analisi teorica - che l'esecuzione del codice ELF del fatbin file, quando il fatbin file sarà eseguito sulla architettura GPU per cui è stato generato, sarà la più breve possibile.

APA, Harvard, Vancouver, ISO, and other styles

34

Adeboye, Taiyelolu. "Robot Goalkeeper : A robotic goalkeeper based on machine vision and motor control." Thesis, Högskolan i Gävle, Avdelningen för elektronik, matematik och naturvetenskap, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:hig:diva-27561.

Full text

Abstract:

This report shows a robust and efficient implementation of a speed-optimized algorithm for object recognition, 3D real world location and tracking in real time. It details a design that was focused on detecting and following objects in flight as applied to a football in motion. An overall goal of the design was to develop a system capable of recognizing an object and its present and near future location while also actuating a robotic arm in response to the motion of the ball in flight. The implementation made use of image processing functions in C++, NVIDIA Jetson TX1, Sterolabs’ ZED stereoscopic camera setup in connection to an embedded system controller for the robot arm. The image processing was done with a textured background and the 3D location coordinates were applied to the correction of a Kalman filter model that was used for estimating and predicting the ball location. A capture and processing speed of 59.4 frames per second was obtained with good accuracy in depth detection while the ball was well tracked in the tests carried out.

APA, Harvard, Vancouver, ISO, and other styles

35

Rek, Václav. "Využití paralelizace při numerickém řešení úloh nelineární dynamiky." Doctoral thesis, Vysoké učení technické v Brně. Fakulta stavební, 2018. http://www.nusl.cz/ntk/nusl-392279.

Full text

Abstract:

The main aim of this thesis is the exploration of the potential use of the parallelism of numerical computations in the field of nonlinear dynamics. In the last decade the dramatic onset of multicore and multi-processor systems in combination with the possibilities which now provide modern computer networks has risen. The complexity and size of the investigated models are constantly increasing due to the high computational complexity of computational tasks in dynamics and statics of structures, mainly because of the nonlinear character of the solved models. Any possibility to speed up such calculation procedures is more than desirable. This is a relatively new branch of science, therefore specific algorithms and parallel implementation are still in the stage of research and development which is attributed to the latest advances in computer hardware, which is growing rapidly. More questions are raised on how best to utilize the available computing power. The proposed parallel model is based on the explicit form of the finite element method, which naturaly provides the possibility of efficient parallelization. The possibilities of multicore processors, as well as parallel hybrid model combining both the possibilities of multicore processors, and the form of the parallelism on a computer network are investigated. The designed approaches are then examined in addressing of the numerical analysis regarding contact/impact phenomena of shell structures.

APA, Harvard, Vancouver, ISO, and other styles

36

Cazalas, Jonathan M. "Efficient and Scalable Evaluation of Continuous, Spatio-temporal Queries in Mobile Computing Environments." Doctoral diss., University of Central Florida, 2012. http://digital.library.ucf.edu/cdm/ref/collection/ETD/id/5154.

Full text

Abstract:

A variety of research exists for the processing of continuous queries in large, mobile environments. Each method tries, in its own way, to address the computational bottleneck of constantly processing so many queries. For this research, we present a two-pronged approach at addressing this problem. Firstly, we introduce an efficient and scalable system for monitoring traditional, continuous queries by leveraging the parallel processing capability of the Graphics Processing Unit. We examine a naive CPU-based solution for continuous range-monitoring queries, and we then extend this system using the GPU. Additionally, with mobile communication devices becoming commodity, location-based services will become ubiquitous. To cope with the very high intensity of location-based queries, we propose a view oriented approach of the location database, thereby reducing computation costs by exploiting computation sharing amongst queries requiring the same view. Our studies show that by exploiting the parallel processing power of the GPU, we are able to significantly scale the number of mobile objects, while maintaining an acceptable level of performance. Our second approach was to view this research problem as one belonging to the domain of data streams. Several works have convincingly argued that the two research fields of spatio-temporal data streams and the management of moving objects can naturally come together. [IlMI10, ChFr03, MoXA04] For example, the output of a GPS receiver, monitoring the position of a mobile object, is viewed as a data stream of location updates. This data stream of location updates, along with those from the plausibly many other mobile objects, is received at a centralized server, which processes the streams upon arrival, effectively updating the answers to the currently active queries in real time. For this second approach, we present GEDS, a scalable, Graphics Processing Unit (GPU)-based framework for the evaluation of continuous spatio-temporal queries over spatio-temporal data streams. Specifically, GEDS employs the computation sharing and parallel processing paradigms to deliver scalability in the evaluation of continuous, spatio-temporal range queries and continuous, spatio-temporal kNN queries. The GEDS framework utilizes the parallel processing capability of the GPU, a stream processor by trade, to handle the computation required in this application. Experimental evaluation shows promising performance and shows the scalability and efficacy of GEDS in spatio-temporal data streaming environments. Additional performance studies demonstrate that, even in light of the costs associated with memory transfers, the parallel processing power provided by GEDS clearly counters and outweighs any associated costs. Finally, in an effort to move beyond the analysis of specific algorithms over the GEDS framework, we take a broader approach in our analysis of GPU computing. What algorithms are appropriate for the GPU? What types of applications can benefit from the parallel and stream processing power of the GPU? And can we identify a class of algorithms that are best suited for GPU computing? To answer these questions, we develop an abstract performance model, detailing the relationship between the CPU and the GPU. From this model, we are able to extrapolate a list of attributes common to successful GPU-based applications, thereby providing insight into which algorithms and applications are best suited for the GPU and also providing an estimated theoretical speedup for said GPU-based applications.
ID: 031001567; System requirements: World Wide Web browser and PDF reader.; Mode of access: World Wide Web.; Title from PDF title page (viewed August 26, 2013).; Thesis (Ph.D.)--University of Central Florida, 2012.; Includes bibliographical references (p. 103-112).
Ph.D.
Doctorate
Computer Science
Engineering and Computer Science
Computer Science

APA, Harvard, Vancouver, ISO, and other styles

37

Pecháček, Václav. "Akcelerace heuristických metod diskrétní optimalizace na GPU." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2012. http://www.nusl.cz/ntk/nusl-236550.

Full text

Abstract:

Thesis deals with discrete optimization problems. It focusses on faster ways to find good solutions by means of heuristics and parallel processing. Based on ant colony optimization (ACO) algorithm coupled with k-optimization local search approach, it aims at massively parallel computing on graphics processors provided by Nvidia CUDA platform. Well-known travelling salesman problem (TSP) is used as a case study. Solution is based on dividing task into subproblems using tour-based partitioning, parallel processing of distinct parts and their consecutive recombination. Provided parallel code can perform computation more than seventeen times faster than the sequential version.

APA, Harvard, Vancouver, ISO, and other styles

38

Maurer, Andreas. "Methods for Multisensory Detection of Light Phenomena on the Moon as a Payload Concept for a Nanosatellite Mission." Thesis, Luleå tekniska universitet, Rymdteknik, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:ltu:diva-80785.

Full text

Abstract:

For 500 years transient light phenomena (TLP) have been observed on the lunar surface by ground-based observers. The actual physical reason for most of these events is today still unknown. Current plans of NASA and SpaceX to send astronauts back to the Moon and already successful deep-space CubeSat mission will allow in the future research nanosatellite missions to the cislunar space. This thesis presents a new hardware and software concept for a future payload on such a nanosatellite. The main task was to develop and implement a high-performance image processing algorithm which task is to detect short brightening flashes on the lunar surface. Based on a review of historic reported phenomena, possible explanation theories for these phenomena and currently active and planed ground- or space-based observatories possible reference scenarios were analyzed. From the presented scenarios one, the detection of brightening events was chosen and requirements for this scenario stated. Afterwards, possible detectors, processing computers and image processing algorithms were researched and compared regarding the specified requirements. This analysis of available algorithm was used to develop a new high-performance detection algorithm to detect transient brightening events on the Moon. The implementation of this algorithm running on the processor and the internal GPU of a MacMini achieved a framerate of 55 FPS by processing images with a resolution of 4.2 megapixel. Its functionality and performance was verified on the remote telescope operated by the Chair of Space Technology of the University of Würzburg. Furthermore, the developed algorithm was also successfully ported on the Nvidia Jetson Nano and its performance compared with a FPGA based image processing algorithm. The results were used to chose a FPGA as the main processing computer of the payload. This concept uses two backside illuminated CMOS image sensor connected to a single FPGA. On the FPGA the developed image processing algorithm should be implemented. Further work is required to realize the proposed concept in building the actual hardware and porting the developed algorithm onto this platform.

APA, Harvard, Vancouver, ISO, and other styles

39

Pospíchal, Petr. "Akcelerace genetického algoritmu s využitím GPU." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2009. http://www.nusl.cz/ntk/nusl-236783.

Full text

Abstract:

This thesis represents master's thesis focused on acceleration of Genetic algorithms using GPU. First chapter deeply analyses Genetic algorithms and corresponding topics like population, chromosome, crossover, mutation and selection. Next part of the thesis shows GPU abilities for unified computing using both DirectX/OpenGL with Cg and specialized GPGPU libraries like CUDA. The fourth chapter focuses on design of GPU implementation using CUDA, coarse-grained and fine-grained GAs are discussed, and completed by sorting and random number generation task accelerated by GPU. Next chapter covers implementation details -- migration, crossover and selection schemes mapped on CUDA software model. All GA elements and quality of GPU results are described in the last chapter.

APA, Harvard, Vancouver, ISO, and other styles

40

Venkatasubramanian, Sundaresan. "Tuned and asynchronous stencil kernels for CPU/GPU systems." Thesis, Atlanta, Ga. : Georgia Institute of Technology, 2009. http://hdl.handle.net/1853/29728.

Full text

Abstract:

Thesis (M. S.)--Computing, Georgia Institute of Technology, 2009.
Committee Chair: Vuduc, Richard; Committee Member: Kim, Hyesoon; Committee Member: Vetter, Jeffrey. Part of the SMARTech Electronic Thesis and Dissertation Collection.

APA, Harvard, Vancouver, ISO, and other styles

41

Lu, Chih-Te, and 盧志德. "Multiview Encoder Parallelized Fast Search Realization on NVIDIA CUDA." Thesis, 2010. http://ndltd.ncl.edu.tw/handle/vdd5k4.

Full text

Abstract:

碩士
國立臺北科技大學
資訊工程系研究所
98
Due to the rapid growth of the graphics processing unit (GPU) processing capability, it gets more and more popular to use it for non-graphics computations. NVIDIA announced a powerful GPU architecture called Compute Unified Device Architecture (CUDA) in 2007, which is able to provide massive data parallelism under the SIMD architecture constraint. We use NVIDIA GTX-280 GPU system, which has 240 computing cores, as the platform to implement a very complicated video coding scheme. The Multiview Video Coding (MVC) scheme, an extension of H.264/AVC/MPEG-4 Part 10 (AVC), is being developed by the international standard team joined by the ITU-T Video Coding Experts Group and the ISO/IEC JTC 1 Moving Pictures Experts Group (MPEG). It is an efficient video compression scheme; however, its computational compexity is very high. Two of its most time-consuming components are motion estimation (ME) and disparity estimation (DE). In this thesis, we propose a fast search algorithm, called multithreaded one-dimensional search (MODS). It can be used to do both the ME and the DE operations. We implement the integer-pel ME and DE processes with MODS on the GTX-280 platform. The speedup ratio can be 89 times faster than the CPU only configuration. Even when the fast search algorithm of the original JMVC is turned on, the MODS version on CUDA can still be 21 times faster.

APA, Harvard, Vancouver, ISO, and other styles

42

Chen, Wei-Nien, and 陳威年. "H.264/AVC Encoder Parallelized Realization on NVIDIA CUDA." Thesis, 2008. http://ndltd.ncl.edu.tw/handle/44801582751450067743.

Full text

Abstract:

碩士
國立交通大學
電子工程系所
96
Due to the rapid growth of graphics processing unit (GPU) processing capability, using GPU as a coprocessor to assist the central processing unit (CPU) in computing massive data becomes essential. NVIDIA announced a powerful GPU architecture called Compute Unified Device Architecture (CUDA) in 2007. This new architecture largely improves the programming flexibility of general-purpose GPU. In this thesis, we propose a highly parallel intra mode selection scheme and a full search motion estimation scheme with fractional pixel refinement optimized for the CUDA architecture. In order to achieve the block-level parallelized intra mode selection, the original pixel values rather than the coded pixels are used for deciding the best intra-prediction mode. In addition, to fully utilize the computation power of CUDA, the thread usage and memory access pattern are carefully tuned. Following the parallel processing optimization rules, we design a motion estimation algorithm consisting of 5 stages. We try to process as many data as possible to fully use the computing power of this GPU. The proposed algorithms are evaluated on the NVIDIA GeForce 8800GTX GPU platform. The speed up ratios of these two modules are about 12 times faster, and the overall H.264/AVC encoding time is about 5 times faster than the PC only counterpart.

APA, Harvard, Vancouver, ISO, and other styles

43

Lai, Chen-Yen, and 賴辰彥. "H.264/AVC-SVC Encoder Parallelized Realization on NVIDIA CUDA." Thesis, 2009. http://ndltd.ncl.edu.tw/handle/56388772998360120538.

Full text

APA, Harvard, Vancouver, ISO, and other styles

44

Tsai, Sung-Han, and 蔡松翰. "Optimization for sparse matrix-vector multiplication based on NVIDIA CUDA platform." Thesis, 2017. http://ndltd.ncl.edu.tw/handle/qw23p7.

Full text

Abstract:

碩士
國立彰化師範大學
資訊工程學系
105
In recent years, large size sparse matrices are often used in fields such as science and engineering which usually apply in computing linear model. Using the ELLPACK format to store sparse matrices, it can reduce the matrix storage space. But if there is too much nonzero elements in one of row of the original sparse matrix, it still waste too much memory space. There are many research focusing on the Sparse Matrix–Vector Multiplication（SpMV）with ELLPACK format on Graphic Processing Unit（GPU）. Therefore, the purpose of our research is reducing the access space of sparse matrix which is transformed in Compressed Sparse Row（CSR）format after Reverse Cutthill-McKee（RCM）algorithm to accelerate for SpMV on GPU. Due to lower data access ratio from SpMV, the performance is restricted by memory bandwidth. Our propose is based on CSR format from two aspects:（1）reduce cache misses to enhance the vector locality and raise the performance, and（2）reduce accessed matrix data by index reduction to optimize the performance.

APA, Harvard, Vancouver, ISO, and other styles

45

Wang, Chun Hung, and 王鈞弘. "A new method for accelerating compound comparison based on NVIDIA CUDA." Thesis, 2016. http://ndltd.ncl.edu.tw/handle/79j237.

Full text

Abstract:

碩士
長庚大學
資訊工程學系
104
Computer-Aided Drug Design (CADD) becomes an emerging research field, and it is helpful to improve the efficiencies of drug design and development. According to the theoretical computation, CADD predicts the properties of physical chemistry, and visualize the 3D-structure of protein, ligand, or their combinations with the functions of computer graphics. In addition, CADD can compute the relation of energy variations between them, or study the pharmacophore of ligand, in order to design new drugs. However, it needs about several weeks or months to do the computer-aided drug design for a taget protein due to the complex computations and then limits the developments. In recent years, the computation capability of graphics processing units (GPU) has been improved greatly by the industry. More and more research fields or applications have tried to use GPUs to speed the computations or extend the experimental scales. It will be a very important issue to apply GPUs to the computer-aided drug design. Compound comparison is an important task for the computational chemistry. By the comparison results, potential inhibitors can be found and then used for the pharmacy experiments. The time complexity of a pairwise compound comparison is O(n2), where n is the maximal length of compounds. In general, the length of compounds is tens to hundreds, and the computation time is small. However, more and more compounds have been synthesized and extracted now, even more than ten of millions. Therefore, it still will be time-consuming when comparing with a large amount of compounds (seen as a multiple compound comparison problem, abbreviated to MCC). The intrinsic time complexity of MCC problem is O(k2n2) with k compounds of length n. In this paper, we propose a GPU-based algorithm for MCC problem, called CUDA-MCC, on multi-GPUs. Four LINGO-based load-balancing strategies are considered in CUDA-MCC in order to accelerate the computation speed among thread blocks on GPUs. CUDA-MCC was implemented by C+ OpenMP +CUDA, and result is faster than its CPU version on NVIDIA Tesla K20c card and a NVIDIA Tesla K40c card and NVIDIA Tegra K1, respectively, under the experimental results.

APA, Harvard, Vancouver, ISO, and other styles

46

Yang, Fu-Kai, and 楊復凱. "Acceleration and Improvement of MPEG View Synthesis Reference Software on NVIDIA CUDA." Thesis, 2012. http://ndltd.ncl.edu.tw/handle/62819489159539479165.

Full text

Abstract:

碩士
國立交通大學
電子研究所
100
With the prosperity of 3D technology, Free Viewpoint Television (FTV) becomes a popular research topic. “View Synthesis” is a key step in FTV. There are some important and to-be-solved issues such as real-time operation and complexity reduction. NVIDIA Compute Unified Device Architecture (CUDA) is an effective platform in handling data-intensive applications. To implement the MPEG view synthesis reference software (VSRS) on CUDA, we parallelize the VSRS structure. In the meanwhile, our proposed parallel scheme improves the picture quality. We first propose an intra hole filling scheme to replace the original median filter. Then, to avoid data dependence we properly partition the data so that they can be processed by the parallel GPU threads. Also, we rearrange the data processing order in the threads to reduce branching instructions. Combining these techniques together, we save more than 94% computing time and achieve a similar image quality.

APA, Harvard, Vancouver, ISO, and other styles

47

Chi, Ping-Lin, and 机炳霖. "Simulation of Optical Properties for Thin Film Using CUDA on NVIDIA GPUs." Thesis, 2011. http://ndltd.ncl.edu.tw/handle/23775928600592540874.

Full text

Abstract:

碩士
國立高雄第一科技大學
光電工程研究所
99
Firstly, the thesis will discuss the difference of parallel computing between the ways of data permutation in multi-threads and single thread, then measure whether the performance of GPU can increase the efficiency and the confidence in the accuracy. Compared with Intel i7 series CPU, the efficiency of NVIDIA G100 series GPU increases more than 40 times, and the effect for relative difference is less than that of 10E-15. That is to say, GPU can be a replacement of CPU to conduct huge calculation. Compared with the simulation programming of development platform with Matlab 2008 , the efficiency increases up to 200 times. In this programming, we have chosen two ways to optimize, including the useful cache memory and PCI-E bandwidth. Besides, it is also be mentioned about the Calculating method for improving the simulation programming and the solution for Many-core processor and multi-GPU. As for the functions, it provides the calculation for the transmittance and reflection of multi-selectivity absorb film, the absorbance of sunlight, the collocation of the best thickness, the supposition of the multi-thickness, and primitive derivation of refractive index and extinction coefficient. The match rate is up to 95% according to comparison of simulation result with experiment date. These are the functions that will be used.

APA, Harvard, Vancouver, ISO, and other styles

48

Chen, Wei-Sheng, and 陳威勝. "Hybrid Simulation of Optical Properties for Thin Film Using CUDA on NVIDIA GPUs." Thesis, 2012. http://ndltd.ncl.edu.tw/handle/99410635297475769146.

Full text

Abstract:

碩士
國立高雄第一科技大學
光電工程研究所
101
This study is optical simulation and multifunction program design by CUDA, it contains: 1.Optimization in the thickness of solar selective absorbing film, 2.Reflectivity of optimal thickness fitting, 3.The reflectivity simulation of double-sided coating, 4.The optimal film thickness fitting on superlattice, 5.The reflectivity of multilayer films and the calculation of the absorption rate. Then measure whether the performance of GPU can increase the efficiency and the confidence in the accuracy. Compared with Intel i7 series CPU, the efficiency of NVIDIA G100 series GPU increases more than 40 times, and the effect for relative difference is less than that of 5%. That is to say, GPU can be a replacement of CPU to conduct huge calculation. Compared with the simulation programming of development platform with Matlab 2008 , the efficiency increases up to 200 times.

APA, Harvard, Vancouver, ISO, and other styles

49

Cieslakiewicz, Dariusz. "Unsupervised asset cluster analysis implemented with parallel genetic algorithms on the NVIDIA CUDA platform." Thesis, 2014.

Find full text

Abstract:

During times of stock market turbulence and crises, monitoring the clustering behaviour of financial instruments allows one to better understand the behaviour of the stock market and the associated systemic risks. In the study undertaken, I apply an effective and performant approach to classify data clusters in order to better understand correlations between stocks. The novel methods aim to address the lack of effective algorithms to deal with high-performance cluster analysis in the context of large complex real-time low-latency data-sets. I apply an efficient and novel data clustering approach, namely the Giada and Marsili log-likelihood function derived from the Noh model and use a Parallel Genetic Algorithm in order to isolate residual data clusters. Genetic Algorithms (GAs) are a very versatile methodology for scientific computing, while the application of Parallel Genetic Algorithms (PGAs) further increases the computational efficiency. They are an effective vehicle to mine data sets for information and traits. However, the traditional parallel computing environment can be expensive. I focused on adopting NVIDIAs Compute Unified Device Architecture (CUDA) programming model in order to develop a PGA framework for my computation solution, where I aim to efficiently filter out residual clusters. The results show that the application of the PGA with the novel clustering function on the CUDA platform is quite effective to improve the computational efficiency of parallel data cluster analysis.

APA, Harvard, Vancouver, ISO, and other styles

50

"Analysis of Hardware Usage Of Shuffle Instruction Based Performance Optimization in the Blinds-II Image Quality Assessment Algorithm." Master's thesis, 2017. http://hdl.handle.net/2286/R.I.45553.

Full text

Abstract:

abstract: With the advent of GPGPU, many applications are being accelerated by using CUDA programing paradigm. We are able to achieve around 10x -100x speedups by simply porting the application on to the GPU and running the parallel chunk of code on its multi cored SIMT (Single instruction multiple thread) architecture. But for optimal performance it is necessary to make sure that all the GPU resources are efficiently used, and the latencies in the application are minimized. For this, it is essential to monitor the Hardware usage of the algorithm and thus diagnose the compute and memory bottlenecks in the implementation. In the following thesis, we will be analyzing the mapping of CUDA implementation of BLIINDS-II algorithm on the underlying GPU hardware, and come up with a Kepler architecture specific solution of using shuffle instruction via CUB library to tackle the two major bottlenecks in the algorithm. Experiments were conducted to convey the advantage of using shuffle instru3ction in algorithm over only using shared memory as a buffer to global memory. With the new implementation of BLIINDS-II algorithm using CUB library, a speedup of around 13.7% was achieved.
Dissertation/Thesis
Masters Thesis Engineering 2017

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!