Dissertations / Theses on the topic 'Nvidia CUDA'
Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles
Consult the top 50 dissertations / theses for your research on the topic 'Nvidia CUDA.'
Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.
You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.
Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.
Zajíc, Jiří. "Překladač jazyka C# do jazyka Nvidia CUDA." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2012. http://www.nusl.cz/ntk/nusl-236439.
Full textSavioli, Nicolo'. "Parallelization of the algorithm WHAM with NVIDIA CUDA." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2013. http://amslaurea.unibo.it/6377/.
Full textIkeda, Patricia Akemi. "Um estudo do uso eficiente de programas em placas gráficas." Universidade de São Paulo, 2011. http://www.teses.usp.br/teses/disponiveis/45/45134/tde-25042012-212956/.
Full textInitially designed for graphical processing, the graphic cards (GPUs) evolved to a high performance general purpose parallel coprocessor. Due to huge potencial that graphic cards offer to several research and commercial areas, NVIDIA was the pioneer lauching of CUDA architecture (compatible with their several cards), an environment that take advantage of computacional power combined with an easier programming. In an attempt to make use of all capacity of GPU, some practices must be followed. One of them is to maximizes hardware utilization. This work proposes a practical and extensible tool that helps the programmer to choose the best configuration and achieve this goal.
Rivera-Polanco, Diego Alejandro. "COLLECTIVE COMMUNICATION AND BARRIER SYNCHRONIZATION ON NVIDIA CUDA GPU." Lexington, Ky. : [University of Kentucky Libraries], 2009. http://hdl.handle.net/10225/1158.
Full textTitle from document title page (viewed on May 18, 2010). Document formatted into pages; contains: ix, 88 p. : ill. Includes abstract and vita. Includes bibliographical references (p. 86-87).
Harvey, Jesse Patrick. "GPU acceleration of object classification algorithms using NVIDIA CUDA /." Online version of thesis, 2009. http://hdl.handle.net/1850/10894.
Full textLerchundi, Osa Gorka. "Fast Implementation of Two Hash Algorithms on nVidia CUDA GPU." Thesis, Norwegian University of Science and Technology, Department of Telematics, 2009. http://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-9817.
Full textUser needs increases as time passes. We started with computers like the size of a room where the perforated plaques did the same function as the current machine code object does and at present we are at a point where the number of processors within our graphic device unit its not enough for our requirements. A change in the evolution of computing is looming. We are in a transition where the sequential computation is losing ground on the benefit of the distributed. And not because of the birth of the new GPUs easily accessible this trend is novel but long before it was used for projects like SETI@Home, fightAIDS@Home, ClimatePrediction and there were shouting from the rooftops about what was to come. Grid computing was its formal name. Until now it was linked only to distributed systems over the network, but as this technology evolves it will take different meaning. nVidia with CUDA has been one of the first companies to make this kind of software package noteworthy. Instead of being a proof of concept its a real tool. Where the transition is expressed in greater magnitude in which the true artist is the programmer who uses it and achieves performance increases. As with many innovations, a community distributed worldwide has grown behind this software package and each one doing its bit. It is noteworthy that after CUDA release a lot of software developments grown like the cracking of the hitherto insurmountable WPA. With Sony-Toshiba-IBM (STI) alliance it could be said the same thing, it has a great community and great software (IBM is the company in charge of maintenance). Unlike nVidia is not as accessible as it is but IBM is powerful enough to enter home made supercomputing market. In this case, after IBM released the PS3 SDK, a notorious application was created using the benefits of parallel computing named Folding@Home. Its purpose is to, inter alia, find the cure for cancer. To sum up, this is only the beginning, and in this thesis is sized up the possibility of using this technology for accelerating cryptographic hash algorithms. BLUE MIDNIGHT WISH (The hash algorithm that is applied to the surgery) is undergone to an environment change adapting it to a parallel capable code for creating empirical measures that compare to the current sequential implementations. It will answer questions that nowadays havent been answered yet. BLUE MIDNIGHT WISH is a candidate hash function for the next NIST standard SHA-3, designed by professor Danilo Gligoroski from NTNU and Vlastimil Klima an independent cryptographer from Czech Republic. So far, from speed point of view BLUE MIDNIGHT WISH is on the top of the charts (generally on the second place right behind EDON-R - another hash function from professor Danilo Gligoroski). One part of the work on this thesis was to investigate is it possible to achieve faster speeds in processing of Blue Midnight Wish when the computations are distributed among the cores in a CUDA device card. My numerous experiments give a clear answer: NO. Although the answer is negative, it still has a significant scientific value. The point is that my work acknowledges viewpoints and standings of a part of the cryptographic community that is doubtful that the cryptographic primitives will benefit when executed in parallel in many cores in one CPU. Indeed, my experiments show that the communication costs between cores in CUDA outweigh by big margin the computational costs done inside one core (processor) unit.
Virk, Bikram. "Implementing method of moments on a GPGPU using Nvidia CUDA." Thesis, Georgia Institute of Technology, 2010. http://hdl.handle.net/1853/33980.
Full textSreenibha, Reddy Byreddy. "Performance Metrics Analysis of GamingAnywhere with GPU accelerated Nvidia CUDA." Thesis, Blekinge Tekniska Högskola, Institutionen för datalogi och datorsystemteknik, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-16846.
Full textBourque, Donald. "CUDA-Accelerated ORB-SLAM for UAVs." Digital WPI, 2017. https://digitalcommons.wpi.edu/etd-theses/882.
Full textSubramoniapillai, Ajeetha Saktheesh. "Architectural Analysis and Performance Characterization of NVIDIA GPUs using Microbenchmarking." The Ohio State University, 2012. http://rave.ohiolink.edu/etdc/view?acc_num=osu1344623484.
Full textNejadfard, Kian. "Context-aware automated refactoring for unified memory allocation in NVIDIA CUDA programs." Cleveland State University / OhioLINK, 2021. http://rave.ohiolink.edu/etdc/view?acc_num=csu1624622944458295.
Full textZaahid, Mohammed. "Performance Metrics Analysis of GamingAnywhere with GPU acceletayed NVIDIA CUDA using gVirtuS." Thesis, Blekinge Tekniska Högskola, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-16852.
Full textKrivoklatský, Filip. "Návrh vestavaného systému inteligentného vidění na platformě NVIDIA." Master's thesis, Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií, 2019. http://www.nusl.cz/ntk/nusl-400627.
Full textLoundagin, Justin. "Optimizing Harris Corner Detection on GPGPUs Using CUDA." DigitalCommons@CalPoly, 2015. https://digitalcommons.calpoly.edu/theses/1348.
Full textShaker, Alfred M. "COMPARISON OF THE PERFORMANCE OF NVIDIA ACCELERATORS WITH SIMD AND ASSOCIATIVE PROCESSORS ON REAL-TIME APPLICATIONS." Kent State University / OhioLINK, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=kent1501084051233453.
Full textAraújo, João Manuel da Silva. "Paralelização de algoritmos de Filtragem baseados em XPATH/XML com recurso a GPUs." Master's thesis, FCT - UNL, 2009. http://hdl.handle.net/10362/2530.
Full textEsta dissertação envolve o estudo da viabilidade da utilização dos GPUs para o processamento paralelo aplicado aos algoritmos de filtragem de notificações num sistema editor/assinante. Este objectivo passou por realizar uma comparação de resultados experimentais entre a versão sequencial (nos CPUs) e a versão paralela de um algoritmo de filtragem escolhido como referência. Essa análise procurou dar elementos para aferir se eventuais ganhos da exploração dos GPUs serão suficientes para compensar a maior complexidade do processo.
Shi, Bobo. "Implementation and Performance Analysis of Many-body Quantum Chemical Methods on the Intel Xeon Phi Coprocessor and NVIDIA GPU Accelerator." The Ohio State University, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=osu1462793739.
Full textČermák, Michal. "Detekce pohyblivého objektu ve videu na CUDA." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2011. http://www.nusl.cz/ntk/nusl-236992.
Full textFuksa, Tomáš. "Paralelizace výpočtů pro zpracování obrazu." Master's thesis, Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií, 2011. http://www.nusl.cz/ntk/nusl-219371.
Full textMašek, Jan. "Dynamický částicový systém jako účinný nástroj pro statistické vzorkování." Doctoral thesis, Vysoké učení technické v Brně. Fakulta stavební, 2018. http://www.nusl.cz/ntk/nusl-390276.
Full textSenthil, Kumar Nithin. "Designing optimized MPI+NCCL hybrid collective communication routines for dense many-GPU clusters." The Ohio State University, 2021. http://rave.ohiolink.edu/etdc/view?acc_num=osu1619132252608831.
Full textBartosch, Nadine. "Correspondence-based pairwise depth estimation with parallel acceleration." Thesis, Mittuniversitetet, Avdelningen för informationssystem och -teknologi, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:miun:diva-34372.
Full textKarri, Venkata Praveen. "Effective and Accelerated Informative Frame Filtering in Colonoscopy Videos Using Graphic Processing Units." Thesis, University of North Texas, 2010. https://digital.library.unt.edu/ark:/67531/metadc31536/.
Full textEkstam, Ljusegren Hannes, and Hannes Jonsson. "Parallelizing Digital Signal Processing for GPU." Thesis, Linköpings universitet, Programvara och system, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-167189.
Full textChehaimi, Omar. "Parallelizzazione dell'algoritmo di ricostruzione di Feldkamp-Davis-Kress per architetture Low-Power di tipo System-On-Chip." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2017. http://amslaurea.unibo.it/13918/.
Full textMintěl, Tomáš. "Interpolace obrazových bodů." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2009. http://www.nusl.cz/ntk/nusl-236736.
Full textNěmeček, Petr. "Geometrické transformace obrazu." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2009. http://www.nusl.cz/ntk/nusl-236764.
Full textHordemann, Glen J. "Exploring High Performance SQL Databases with Graphics Processing Units." Bowling Green State University / OhioLINK, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=bgsu1380125703.
Full textMusic, Sani. "Grafikkort till parallella beräkningar." Thesis, Malmö högskola, Fakulteten för teknik och samhälle (TS), 2012. http://urn.kb.se/resolve?urn=urn:nbn:se:mau:diva-20150.
Full textThis study describes how we can use graphics cards for general purpose computingwhich differs from the most usual field where graphics cards are used, multimedia.The study describes and discusses present day alternatives for usinggraphic cards for general operations. In this study we use and describe NvidiaCUDA architecture. The study describes how we can use graphic cards for generaloperations from the point of view that we have programming knowledgein some high-level programming language and knowledge of how a computerworks. We use accelerated libraries (THRUST and CUBLAS) to achieve our goalson the graphics card, which are software development and benchmarking. Theresults are programs countering certain problems (matrix multiplication, sorting,binary search, vector inverting) and the execution time and speedup forthese programs. The graphics card is compared to the processor in serial andthe processor in parallel. Results show a speedup of up to approximatly 50 timescompared to serial implementations on the processor.
Farabegoli, Nicolas. "Implementazione ottimizata dell'operatore di Dirac su GPGPU." Bachelor's thesis, Alma Mater Studiorum - Università di Bologna, 2020. http://amslaurea.unibo.it/20356/.
Full textMacenauer, Pavel. "Detekce objektů na GPU." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2015. http://www.nusl.cz/ntk/nusl-234942.
Full textStraňák, Marek. "Raytracing na GPU." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2011. http://www.nusl.cz/ntk/nusl-237020.
Full textArtico, Fausto. "Performance Optimization Of GPU ELF-Codes." Doctoral thesis, Università degli studi di Padova, 2014. http://hdl.handle.net/11577/3424532.
Full textGPUs (Graphic Processing Units) sono di interesse per il loro favorevole rapporto $\frac{GF/s}{price}$. Rispetto all'inizio - primi anni 70 - oggigiorno le architectture GPU sono più simili ad architectture general purpose ma hanno un numero (molto) più grande di cores - la architecttura GF100 rilasciata da NVIDIA durante il 2009-2010, per esempio, ha una vera gerarchia di memoria cache, uno spazio unificato per l'indirizzamento in memoria, è in grado di eseguire calcoli in doppia precisione ed ha un massimo 512 core. Sfruttare la potenza computazionale delle GPU per applicazioni non grafiche - passate o presenti - è, comunque, sempre stato difficile. Inizialmente, nei primi anni 2000, la programmazione su GPU avveniva (esclusivamente) attraverso l'uso librerie grafiche, le quali rendevano la scrittura di codici non grafici non triviale e tediosa al meglio, e virtualmente impossibile al peggio. Nel 2003, furono introdotti il compilatore e il sistema runtime Brook che diedero agli utenti l'abilità di generare codice GPU da un linguaggio di programmazione ad alto livello. Nel 2006 NVIDIA introdusse CUDA (Compute Unified Device Architecture). CUDA, un modello di programmazione e computazione parallela specificamente sviluppato da NVIDIA per le sue GPUs, tenta di facilitare ulteriormente la programmazione general purpose di GPU. Codice scritto in CUDA è portabile tra differenti architectture GPU della NVIDIA e questa è una delle ragioni perché NVIDIA afferma che la produttività degli utenti è molto più alta di precedenti soluzioni, tuttavia ottimizare codice GPU con l'obbiettivo di ottenere le massime prestazioni rimane molto difficile, specialmente per NVIDIA GPUs che usano l'architecttura GF100 - per esempio, Fermi GPUs e delle Tesla GPUs - perché a) il vero instruction set architecture (ISA) è non pubblicamente disponibile, b) il codice del compilatore NVIDIA - nvcc - è non aperto e c) gli utenti non possono scrivere codice usando il vero assembly - ELF nel gergo della NVIDIA. I compilatori, mentre permettono un immenso incremento della produttività di un programmatore, eliminando la necessità di codificare al (tedioso) livello assembly, sono incapaci di ottenere, a questa data, prestazioni simili a quelle di un programmatore che è esperto in assembly ed ha una buona conoscenza dell'architettura sottostante. Infatti, è largamente accettato che programmazione ad alto livello e compilazione perfino con compilatori che sono considerati allo stato dell'arte perdono, in media, un fattore 3 in prestazione - e a volte molto di più - nei confronti di cosa un buon programmatore assembly potrebbe ottenere, e questo perfino su una macchina convenzionale, semplice, a singolo core. Compilatori per macchine più complesse, come le GPU NVIDIA, sono propensi a fare molto peggio perché tra le altre cose, essi devono determinare (persino più) complessi trade-offs durante la ricerca di soluzioni a problemi spesso indecidibili e NP-hard. Peraltro, perché NVIDIA a) rende virtualmente impossibile guadagnare accesso all'attuale linguaggio assembly usato dalla architettura GF100, b) non spiega pubblicamente molti dei meccanismi interni implementati nel suo compilatore - nvcc - e c) rende virtualmente impossible imparare i dettagli della molto complessa architecttura GF100 ad un sufficiente livello di dettaglio che permetta di sfruttarli, ottenere una stima delle differenze prestazionali tra programmazione in CUDA e programmazione a livello macchina per GPU NVIDIA che usano la architecttura GF100 - per non parlare dell'ottenimento a priori di garanzie di tempo di esecuzione più breve - è stato, prima di questo corrente lavoro, impossbile. Per ottimizare codice GPU, gli utenti devono usare CUDA or PTX (Parallel Thread Execution) - un instruction set architecture virtuale. I file CUDA or PTX sono dati in input a nvcc che produce come output fatbin file. I fatbin file sono prodotti considerando l'architecttura GPU selezionata dall'utente - questo è fatto settando un flag usato da nvcc. In un fatbin file, zero o più parti del fatbin file saranno eseguite dalla CPU - pensa a queste parti come le parti C/C++ - mentre le rimanenti parti del fatbin file - pensa a queste parti come le parti ELF - saranno eseguite dallo specifico modello GPU per il quale i file CUDA or PTX sono stati compilati. I fatbin file sono normalmente molto differenti dai corrispodenti file CUDA o PTX e questa assenza di controllo può completamente rovinare qualsiasi sforzo fatto a livello CUDA o PTX per otimizzare la parte o le parti ELF del fatbin file che sarà eseguita / saranno eseguite dalla GPU per la quale il fatbin file è stato compilato. Noi quindi scopriamo quale è il vero ISA usato dalla architettura GF100 e generiamo un insieme di linea guida per scrivere codice in modo tale da forzare nvcc a generare fatbin file con almeno il minimo numero di risorse successivamente necessario per modificare i fatbin file per ottenere le volute implementazioni algoritmiche in ELF - questo da controllo sul codice ELF che è eseguito da qualsiasi GPU che usa l'architettura GF100. Durante il processo di scoperata del vero ISA scopriamo anche le corrispondenze tra istruzioni PTX e istruzioni ELF - una singola istructione PTX può essere transformata in one o più istruzioni ELF - e le corrispondenze tra registri PTX e registri ELF. La nostra procedura è completamente ripetibile per ogni NVIDIA Kepler GPU - non occorre che riscrivamo il nostro codice. Essere in grado di ottenere le volute implementazioni algoritmiche in ELF non è abbastanza per ottimizzare il codice ELF di un fatbin file, ci occorre infatti anche scoprire, comprendere e quantificare dei comportamenti GPU che non sono divulgati e che potrebbero rallentare l'esecuzione di codice ELF. Questo è necessario per comprendere come eseguire il processo di ottimizzazione e mentre noi non possiamo riportare qui tutti i risultati che abbiamo ottenuto, noi possiamo comunque dire che spiegheremo al lettore a) come forzare una distribuzione uniforme dei GPU thread blocks agli streaming multiprocessors, b) come abbiamo scoperto e quantificato diversi fenomeni riguardanti il warp scheduling, c) come evitare fenomeni di warp scheduling load unblanacing, che è non possible controllare, negli streaming multiprocessors, d) come abbiamo determinato, per ogni istruzione ELF, la minima quantità di tempo che è necessario attendere prima che un warp scheduler possa schedulare ancora un warp - si, la quantità di tempo può essere differente per differenti istruzioni ELF - e) come abbiamo determinato il tempo che è necessario attendere prima di essere in grado di leggere ancora un dato in un registro precedentemente letto o scritto - questo pure può essere differente per differnti istruzioni ELF e differente se il dato è stato precedentemente letto o scritto - e f) come abbiamo scoperto la presenza di un tempo di overhead per la gestione dei warp che non cresce linearmente ad un incremento lineare del numero di warp residenti in uno streaming multiprocessor. Successivamente, noi spiegamo a) le procedure di trasformazione che è necessario applicare al codice ELF di un fatbin file per ottimizzare il codice ELF e così rendere il suo tempo di esecuzione il più corto possibile, b) perché occorre classificare i fatbin file generati dal fatbin file originale durante il processo di ottimizzazione e come noi facciamo questo usando diversi criteri che come risultato finale permettono a noi di determinare le posizioni, occupate da ogni fatbin file generato, in una tassonomia che noi abbiamo creato, c) come usando la posizione di un fatbin file nella tassonomia noi determiniamo se il fatbin file è qualificato per una analisi empirica - che noi spieghiamo - una analisi teorica o entrambe and d) come - supponendo il fatbin file sia qualificato per una analisi teorica - noi eseguiamo l'analisi teorica che abbiamo ideato e diamo a priori - senza alcuna precedente esecuzione del fatbin file - la garanzia - questo supponendo il fatbin file soddisfi tutti i requisiti dell'analisi teorica - che l'esecuzione del codice ELF del fatbin file, quando il fatbin file sarà eseguito sulla architettura GPU per cui è stato generato, sarà la più breve possibile.
Adeboye, Taiyelolu. "Robot Goalkeeper : A robotic goalkeeper based on machine vision and motor control." Thesis, Högskolan i Gävle, Avdelningen för elektronik, matematik och naturvetenskap, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:hig:diva-27561.
Full textRek, Václav. "Využití paralelizace při numerickém řešení úloh nelineární dynamiky." Doctoral thesis, Vysoké učení technické v Brně. Fakulta stavební, 2018. http://www.nusl.cz/ntk/nusl-392279.
Full textCazalas, Jonathan M. "Efficient and Scalable Evaluation of Continuous, Spatio-temporal Queries in Mobile Computing Environments." Doctoral diss., University of Central Florida, 2012. http://digital.library.ucf.edu/cdm/ref/collection/ETD/id/5154.
Full textID: 031001567; System requirements: World Wide Web browser and PDF reader.; Mode of access: World Wide Web.; Title from PDF title page (viewed August 26, 2013).; Thesis (Ph.D.)--University of Central Florida, 2012.; Includes bibliographical references (p. 103-112).
Ph.D.
Doctorate
Computer Science
Engineering and Computer Science
Computer Science
Pecháček, Václav. "Akcelerace heuristických metod diskrétní optimalizace na GPU." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2012. http://www.nusl.cz/ntk/nusl-236550.
Full textMaurer, Andreas. "Methods for Multisensory Detection of Light Phenomena on the Moon as a Payload Concept for a Nanosatellite Mission." Thesis, Luleå tekniska universitet, Rymdteknik, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:ltu:diva-80785.
Full textPospíchal, Petr. "Akcelerace genetického algoritmu s využitím GPU." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2009. http://www.nusl.cz/ntk/nusl-236783.
Full textVenkatasubramanian, Sundaresan. "Tuned and asynchronous stencil kernels for CPU/GPU systems." Thesis, Atlanta, Ga. : Georgia Institute of Technology, 2009. http://hdl.handle.net/1853/29728.
Full textCommittee Chair: Vuduc, Richard; Committee Member: Kim, Hyesoon; Committee Member: Vetter, Jeffrey. Part of the SMARTech Electronic Thesis and Dissertation Collection.
Lu, Chih-Te, and 盧志德. "Multiview Encoder Parallelized Fast Search Realization on NVIDIA CUDA." Thesis, 2010. http://ndltd.ncl.edu.tw/handle/vdd5k4.
Full text國立臺北科技大學
資訊工程系研究所
98
Due to the rapid growth of the graphics processing unit (GPU) processing capability, it gets more and more popular to use it for non-graphics computations. NVIDIA announced a powerful GPU architecture called Compute Unified Device Architecture (CUDA) in 2007, which is able to provide massive data parallelism under the SIMD architecture constraint. We use NVIDIA GTX-280 GPU system, which has 240 computing cores, as the platform to implement a very complicated video coding scheme. The Multiview Video Coding (MVC) scheme, an extension of H.264/AVC/MPEG-4 Part 10 (AVC), is being developed by the international standard team joined by the ITU-T Video Coding Experts Group and the ISO/IEC JTC 1 Moving Pictures Experts Group (MPEG). It is an efficient video compression scheme; however, its computational compexity is very high. Two of its most time-consuming components are motion estimation (ME) and disparity estimation (DE). In this thesis, we propose a fast search algorithm, called multithreaded one-dimensional search (MODS). It can be used to do both the ME and the DE operations. We implement the integer-pel ME and DE processes with MODS on the GTX-280 platform. The speedup ratio can be 89 times faster than the CPU only configuration. Even when the fast search algorithm of the original JMVC is turned on, the MODS version on CUDA can still be 21 times faster.
Chen, Wei-Nien, and 陳威年. "H.264/AVC Encoder Parallelized Realization on NVIDIA CUDA." Thesis, 2008. http://ndltd.ncl.edu.tw/handle/44801582751450067743.
Full text國立交通大學
電子工程系所
96
Due to the rapid growth of graphics processing unit (GPU) processing capability, using GPU as a coprocessor to assist the central processing unit (CPU) in computing massive data becomes essential. NVIDIA announced a powerful GPU architecture called Compute Unified Device Architecture (CUDA) in 2007. This new architecture largely improves the programming flexibility of general-purpose GPU. In this thesis, we propose a highly parallel intra mode selection scheme and a full search motion estimation scheme with fractional pixel refinement optimized for the CUDA architecture. In order to achieve the block-level parallelized intra mode selection, the original pixel values rather than the coded pixels are used for deciding the best intra-prediction mode. In addition, to fully utilize the computation power of CUDA, the thread usage and memory access pattern are carefully tuned. Following the parallel processing optimization rules, we design a motion estimation algorithm consisting of 5 stages. We try to process as many data as possible to fully use the computing power of this GPU. The proposed algorithms are evaluated on the NVIDIA GeForce 8800GTX GPU platform. The speed up ratios of these two modules are about 12 times faster, and the overall H.264/AVC encoding time is about 5 times faster than the PC only counterpart.
Lai, Chen-Yen, and 賴辰彥. "H.264/AVC-SVC Encoder Parallelized Realization on NVIDIA CUDA." Thesis, 2009. http://ndltd.ncl.edu.tw/handle/56388772998360120538.
Full textTsai, Sung-Han, and 蔡松翰. "Optimization for sparse matrix-vector multiplication based on NVIDIA CUDA platform." Thesis, 2017. http://ndltd.ncl.edu.tw/handle/qw23p7.
Full text國立彰化師範大學
資訊工程學系
105
In recent years, large size sparse matrices are often used in fields such as science and engineering which usually apply in computing linear model. Using the ELLPACK format to store sparse matrices, it can reduce the matrix storage space. But if there is too much nonzero elements in one of row of the original sparse matrix, it still waste too much memory space. There are many research focusing on the Sparse Matrix–Vector Multiplication(SpMV)with ELLPACK format on Graphic Processing Unit(GPU). Therefore, the purpose of our research is reducing the access space of sparse matrix which is transformed in Compressed Sparse Row(CSR)format after Reverse Cutthill-McKee(RCM)algorithm to accelerate for SpMV on GPU. Due to lower data access ratio from SpMV, the performance is restricted by memory bandwidth. Our propose is based on CSR format from two aspects:(1)reduce cache misses to enhance the vector locality and raise the performance, and(2)reduce accessed matrix data by index reduction to optimize the performance.
Wang, Chun Hung, and 王鈞弘. "A new method for accelerating compound comparison based on NVIDIA CUDA." Thesis, 2016. http://ndltd.ncl.edu.tw/handle/79j237.
Full text長庚大學
資訊工程學系
104
Computer-Aided Drug Design (CADD) becomes an emerging research field, and it is helpful to improve the efficiencies of drug design and development. According to the theoretical computation, CADD predicts the properties of physical chemistry, and visualize the 3D-structure of protein, ligand, or their combinations with the functions of computer graphics. In addition, CADD can compute the relation of energy variations between them, or study the pharmacophore of ligand, in order to design new drugs. However, it needs about several weeks or months to do the computer-aided drug design for a taget protein due to the complex computations and then limits the developments. In recent years, the computation capability of graphics processing units (GPU) has been improved greatly by the industry. More and more research fields or applications have tried to use GPUs to speed the computations or extend the experimental scales. It will be a very important issue to apply GPUs to the computer-aided drug design. Compound comparison is an important task for the computational chemistry. By the comparison results, potential inhibitors can be found and then used for the pharmacy experiments. The time complexity of a pairwise compound comparison is O(n2), where n is the maximal length of compounds. In general, the length of compounds is tens to hundreds, and the computation time is small. However, more and more compounds have been synthesized and extracted now, even more than ten of millions. Therefore, it still will be time-consuming when comparing with a large amount of compounds (seen as a multiple compound comparison problem, abbreviated to MCC). The intrinsic time complexity of MCC problem is O(k2n2) with k compounds of length n. In this paper, we propose a GPU-based algorithm for MCC problem, called CUDA-MCC, on multi-GPUs. Four LINGO-based load-balancing strategies are considered in CUDA-MCC in order to accelerate the computation speed among thread blocks on GPUs. CUDA-MCC was implemented by C+ OpenMP +CUDA, and result is faster than its CPU version on NVIDIA Tesla K20c card and a NVIDIA Tesla K40c card and NVIDIA Tegra K1, respectively, under the experimental results.
Yang, Fu-Kai, and 楊復凱. "Acceleration and Improvement of MPEG View Synthesis Reference Software on NVIDIA CUDA." Thesis, 2012. http://ndltd.ncl.edu.tw/handle/62819489159539479165.
Full text國立交通大學
電子研究所
100
With the prosperity of 3D technology, Free Viewpoint Television (FTV) becomes a popular research topic. “View Synthesis” is a key step in FTV. There are some important and to-be-solved issues such as real-time operation and complexity reduction. NVIDIA Compute Unified Device Architecture (CUDA) is an effective platform in handling data-intensive applications. To implement the MPEG view synthesis reference software (VSRS) on CUDA, we parallelize the VSRS structure. In the meanwhile, our proposed parallel scheme improves the picture quality. We first propose an intra hole filling scheme to replace the original median filter. Then, to avoid data dependence we properly partition the data so that they can be processed by the parallel GPU threads. Also, we rearrange the data processing order in the threads to reduce branching instructions. Combining these techniques together, we save more than 94% computing time and achieve a similar image quality.
Chi, Ping-Lin, and 机炳霖. "Simulation of Optical Properties for Thin Film Using CUDA on NVIDIA GPUs." Thesis, 2011. http://ndltd.ncl.edu.tw/handle/23775928600592540874.
Full text國立高雄第一科技大學
光電工程研究所
99
Firstly, the thesis will discuss the difference of parallel computing between the ways of data permutation in multi-threads and single thread, then measure whether the performance of GPU can increase the efficiency and the confidence in the accuracy. Compared with Intel i7 series CPU, the efficiency of NVIDIA G100 series GPU increases more than 40 times, and the effect for relative difference is less than that of 10E-15. That is to say, GPU can be a replacement of CPU to conduct huge calculation. Compared with the simulation programming of development platform with Matlab 2008 , the efficiency increases up to 200 times. In this programming, we have chosen two ways to optimize, including the useful cache memory and PCI-E bandwidth. Besides, it is also be mentioned about the Calculating method for improving the simulation programming and the solution for Many-core processor and multi-GPU. As for the functions, it provides the calculation for the transmittance and reflection of multi-selectivity absorb film, the absorbance of sunlight, the collocation of the best thickness, the supposition of the multi-thickness, and primitive derivation of refractive index and extinction coefficient. The match rate is up to 95% according to comparison of simulation result with experiment date. These are the functions that will be used.
Chen, Wei-Sheng, and 陳威勝. "Hybrid Simulation of Optical Properties for Thin Film Using CUDA on NVIDIA GPUs." Thesis, 2012. http://ndltd.ncl.edu.tw/handle/99410635297475769146.
Full text國立高雄第一科技大學
光電工程研究所
101
This study is optical simulation and multifunction program design by CUDA, it contains: 1.Optimization in the thickness of solar selective absorbing film, 2.Reflectivity of optimal thickness fitting, 3.The reflectivity simulation of double-sided coating, 4.The optimal film thickness fitting on superlattice, 5.The reflectivity of multilayer films and the calculation of the absorption rate. Then measure whether the performance of GPU can increase the efficiency and the confidence in the accuracy. Compared with Intel i7 series CPU, the efficiency of NVIDIA G100 series GPU increases more than 40 times, and the effect for relative difference is less than that of 5%. That is to say, GPU can be a replacement of CPU to conduct huge calculation. Compared with the simulation programming of development platform with Matlab 2008 , the efficiency increases up to 200 times.
Cieslakiewicz, Dariusz. "Unsupervised asset cluster analysis implemented with parallel genetic algorithms on the NVIDIA CUDA platform." Thesis, 2014.
Find full text"Analysis of Hardware Usage Of Shuffle Instruction Based Performance Optimization in the Blinds-II Image Quality Assessment Algorithm." Master's thesis, 2017. http://hdl.handle.net/2286/R.I.45553.
Full textDissertation/Thesis
Masters Thesis Engineering 2017