Dissertations / Theses: 'LU factorization'

1

Syed, Akber. "A Hardware Interpreter for Sparse Matrix LU Factorization." University of Cincinnati / OhioLINK, 2002. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1024934521.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Schenk, Olaf. "Scalable parallel sparse LU factorization methods on shared memory multiprocessors /." Zürich, 2000. http://e-collection.ethbib.ethz.ch/show?type=diss&nr=13515.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

THIYAGARAJAN, SANJEEV. "REDUCING MEMORY SPACE FOR COMPLETELY UNROLLED LU FACTORIZATION OF SPARSE MATRICES." University of Cincinnati / OhioLINK, 2001. http://rave.ohiolink.edu/etdc/view?acc_num=ucin990556295.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Somers, Gregory W. "Acceleration of Block-Aware Matrix Factorization on Heterogeneous Platforms." Thesis, Université d'Ottawa / University of Ottawa, 2016. http://hdl.handle.net/10393/35128.

Full text

Abstract:

Block-structured matrices arise in several contexts in circuit simulation problems. These matrices typically inherit the pattern of sparsity from the circuit connectivity. However, they are also characterized by dense spots or blocks. Direct factorization of those matrices has emerged as an attractive approach if the host memory is sufficiently large to store the block-structured matrix. The approach proposed in this thesis aims to accelerate the direct factorization of general block-structured matrices by leveraging the power of multiple OpenCL accelerators such as Graphical Processing Units (GPUs). The proposed approach utilizes the notion of a Directed Acyclic Graph representing the matrix in order to schedule its factorization on multiple accelerators. This thesis also describes memory management techniques that enable handling large matrices while minimizing the amount of memory transfer over the PCIe bus between the host CPU and the attached devices. The results demonstrate that by using two GPUs the proposed approach can achieve a nearly optimal speedup when compared to a single GPU platform.

APA, Harvard, Vancouver, ISO, and other styles

5

Netzer, Gilbert. "Efficient LU Factorization for Texas Instruments Keystone Architecture Digital Signal Processors." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-170445.

Full text

Abstract:

The energy consumption of large-scale high-performance computer (HPC) systems has become one of the foremost concerns of both data-center operators and computer manufacturers. This has renewed interest in alternative computer architectures that could offer substantially better energy-efficiency.Yet, the for the evaluation of the potential of these architectures necessary well-optimized implementations of typical HPC benchmarks are often not available for these for the HPC industry novel architectures. The in this work presented LU factorization benchmark implementation aims to provide such a high-quality tool for the HPC industry standard high-performance LINPACK benchmark (HPL) for the eight-core Texas Instruments TMS320C6678 digitalsignal processor (DSP). The presented implementation could perform the LU factorization at up to 30.9 GF/s at 1.25 GHz core clock frequency by using all the eight DSP cores of the System-on-Chip (SoC). This is 77% of the attainable peak double-precision floating-point performance of the DSP, a level of efficiency that is comparable to the efficiency expected on traditional x86-based processor architectures. A presented detailed performance analysis shows that this is largely due to the optimized implementation of the embedded generalized matrix-matrix multiplication (GEMM). For this operation, the on-chip direct memory access (DMA) engines were used to transfer the necessary data from the external DDR3 memory to the core-private and shared scratchpad memory. This allowed to overlap the data transfer with computations on the DSP cores. The computations were in turn optimized by using software pipeline techniques and were partly implemented in assembly language. With these optimization the performance of the matrix multiplication reached up to 95% of attainable peak performance. A detailed description of these two key optimization techniques and their application to the LU factorization is included. Using a specially instrumented Advantech TMDXEVM6678L evaluation module, described in detail in related work, allowed to measure the SoC’s energy efficiency of up to 2.92 GF/J while executing the presented benchmark. Results from the verification of the benchmark execution using standard HPL correctness checks and an uncertainty analysis of the experimentally gathered data are also presented. Energiförbrukningen av storskaliga högpresterande datorsystem (HPC) har blivit ett av de främsta problemen för såväl ägare av dessa system som datortillverkare. Det har lett till ett förnyat intresse för alternativa datorarkitekturer som kan vara betydligt mer effektiva ur energiförbrukningssynpunkt. För detaljerade analyser av prestanda och energiförbrukning av dessa för HPC-industrin nya arkitekturer krävs väloptimerade implementationer av standard HPC-bänkmärkningsproblem. Syftet med detta examensarbete är att tillhandhålla ett sådant högkvalitativt verktyg i form av en implementation av ett bänkmärkesprogram för LU-faktorisering för den åttakärniga digitala signalprocessorn (DSP) TMS320C6678 från Texas Instruments. Bänkmärkningsproblemet är samma som för det inom HPC-industrin välkända bänkmärket “high-performance LINPACK” (HPL). Den här presenterade implementationen nådde upp till en prestanda av 30,9 GF/s vid 1,25 GHz klockfrekvens genom att samtidigt använda alla åtta kärnor i DSP:n. Detta motsvarar 77% av den teoretiskt uppnåbara prestandan, vilket är jämförbart med förväntningar på effektivteten av mer traditionella x86-baserade system. En detaljerad prestandaanalys visar att detta tillstor del uppnås genom den högoptimerade implementationen av den ingående matris-matris-multiplikationen. Användandet av specialiserade “direct memory access” (DMA) hårdvaruenheter för kopieringen av data mellan det externa DDR3 minnet och det interna kärn-privata och delade arbetsminnet tillät att överlappa dessa operationer med beräkningar. Optimerade mjukvaruimplementationer av dessa beräkningar, delvis utförda i maskinspåk, tillät att utföra matris-multiplikationen med upp till 95% av den teoretiskt nåbara prestandan. I rapporten ges en detaljerad beskrivning av dessa två nyckeltekniker. Energiförbrukningen vid exekvering av det implementerade bänkmärket kunde med hjälp av en för ändamålet anpassad Advantech TMDXEVM6678L evalueringsmodul bestämmas till maximalt 2,92 GF/J. Resultat från verifikationen av bänkmärkesimplementationen och en uppskattning av mätosäkerheten vid de experimentella mätningarna presenteras också.

APA, Harvard, Vancouver, ISO, and other styles

6

Cantane, Daniela Renata. "Contribuição da atualização da decomposição LU no metodo Simplex." [s.n.], 2009. http://repositorio.unicamp.br/jspui/handle/REPOSIP/260212.

Full text

Abstract:

Orientadores: Aurelio Ribeiro Leite de Oliveira, Christiano Lyra Filho Tese (doutorado) - Universidade Estadual de Campinas, Faculdade de Engenharia Eletrica e de Computação Made available in DSpace on 2018-08-14T10:57:39Z (GMT). No. of bitstreams: 1 Cantane_DanielaRenata.pdf: 1253133 bytes, checksum: 870b16a2b9360f77ebd88f50491d181c (MD5) Previous issue date: 2009 Resumo: A solução eficiente de sistemas lineares é fundamental em problemas de otimização linear e o primeiro método a obter êxito nesta classe de problemas foi o método Simplex. Com o objetivo de desenvolver alternativas eficientes para sua implementação, são apresentadas nesta tese técnicas de atualização da decomposição LU da base para aperfeiçoar a solução dos sistemas lineares oriundos do método Simplex, utilizando um reordenamento estático nas colunas da matriz. Uma simulação do método Simplex é implementada, realizando troca de bases obtidas pelo MINOS e verificando sua esparsidade. Somente os elementos afetados pela mudança de base são considerados para obter uma atualização da decomposição LU eficaz. As colunas da matriz são reordenadas de acordo com três estratégias: mínimo grau; forma bloco triangular e estratégia de Björck. Assim, obtém-se uma decomposição esparsa para qualquer base sem esforço computacional para obter a ordem das colunas, pois o reordenamento da matriz é estático e as colunas da base obedecem esta ordem. A forma bloco triangular obteve os melhores resultados, para os maiores problemas testados, em relação ao mínimo grau e a estratégia de Björck. Resultados computacionais para problemas da Netlib mostram a robustez e um bom desempenho computacional do método de atualização da decomposição LU proposto, pois não são necessárias refatorações periódicas da base como nos métodos de atualização tradicionais. O método proposto obteve uma redução do número de elementos não nulos da base em relação ao MINOS. Esta abordagem foi aplicada em problemas de corte de estoque e a atualização da decomposição LU proposta obteve uma redução do tempo computacional na solução destes problemas em relação ao GPLK. Abstract: Finding efficient solution of linear systems is fundamental in the linear programming problems and the first method to obtain success for this class of problems was the Simplex method. With the objective to develop efficient alternatives to its implementation, techniques of the simplex basis LU factorization update are developed in this thesis to improve the solution of the Simplex method linear systems towards a matrix columns static reordering. A simulation of the Simplex method is implemented, carrying through the change of basis obtained from MINOS and verifying its sparsity. Only the factored columns actually modified by the change of the base are carried through to obtain an efficient LU factorization update. The matrix columns are reordered according to three strategies: minimum degree; block triangular form and the Björck strategy. Thus, sparse factorizations are obtained for any base without computational effort to obtain the order of columns, since the reordering of the matrix is static and base columns follow this ordering. The application of the block triangular form achieved the best results, for larger scale problems tested, in comparison to minimum degree method and the Björck strategy. Computational results for Netlib problems show the robustness of this approach and good computational performance, since there is no need of periodical factorizations as used in traditional updating methods. The proposed method obtained a reduction of the nonzero entries of the basis with respect to MINOS. This approach was applied in the cutting stock problems and the proposed method achieved a reduction of the computational time in the solution of such problems with respect to the GLPK. Universidade Estadual de Campi Automação Doutor em Engenharia Elétrica

APA, Harvard, Vancouver, ISO, and other styles

7

Herrmann, Julien. "Memory-aware Algorithms and Scheduling Techniques for Matrix Computattions." Thesis, Lyon, École normale supérieure, 2015. http://www.theses.fr/2015ENSL1043/document.

Full text

Abstract:

Dans cette thèse, nous nous sommes penchés d’un point de vue à la foisthéorique et pratique sur la conception d’algorithmes et detechniques d’ordonnancement adaptées aux architectures complexes dessuperordinateurs modernes. Nous nous sommes en particulier intéressésà l’utilisation mémoire et la gestion des communications desalgorithmes pour le calcul haute performance (HPC). Nous avonsexploité l’hétérogénéité des superordinateurs modernes pour améliorerles performances du calcul matriciel. Nous avons étudié lapossibilité d’alterner intelligemment des étapes de factorisation LU(plus rapide) et des étapes de factorisation QR (plus stablenumériquement mais plus deux fois plus coûteuses) pour résoudre unsystème linéaire dense. Nous avons amélioré les performances desystèmes d’exécution dynamique à l’aide de pré-calculs statiquesprenants en compte l’ensemble du graphe de tâches de la factorisationCholesky ainsi que l’hétérogénéité de l’architecture. Nous noussommes intéressés à la complexité du problème d’ordonnancement degraphes de tâches utilisant de gros fichiers d’entrée et de sortiesur une architecture hétérogène avec deux types de ressources,utilisant chacune une mémoire spécifique. Nous avons conçu denombreuses heuristiques en temps polynomial pour la résolution deproblèmes généraux que l’on avait prouvés NP-complet aupréalable. Enfin, nous avons conçu des algorithmes optimaux pourordonnancer un graphe de différentiation automatique sur uneplateforme avec deux types de mémoire : une mémoire gratuite maislimitée et une mémoire coûteuse mais illimitée Throughout this thesis, we have designed memory-aware algorithms and scheduling techniques suitedfor modern memory architectures. We have shown special interest in improving the performance ofmatrix computations on multiple levels. At a high level, we have introduced new numerical algorithmsfor solving linear systems on large distributed platforms. Most of the time, these linear solvers rely onruntime systems to handle resources allocation and data management. We also focused on improving thedynamic schedulers embedded in these runtime systems by adding static information to their decisionprocess. We proposed new memory-aware dynamic heuristics to schedule workflows, that could beimplemented in such runtime systems.Altogether, we have dealt with multiple state-of-the-art factorization algorithms used to solve linearsystems, like the LU, QR and Cholesky factorizations. We targeted different platforms ranging frommulticore processors to distributed memory clusters, and worked with several reference runtime systemstailored for these architectures, such as P A RSEC and StarPU. On a theoretical side, we took specialcare of modelling convoluted hierarchical memory architectures. We have classified the problems thatare arising when dealing with these storage platforms. We have designed many efficient polynomial-timeheuristics on general problems that had been shown NP-complete beforehand

APA, Harvard, Vancouver, ISO, and other styles

8

Assis, Carmencita Ferreira Silva. "Sistemas lineares: métodos de eliminação de Gauss e fatoração LU." Universidade Federal de Goiás, 2014. http://repositorio.bc.ufg.br/tede/handle/tede/4490.

Full text

Abstract:

Submitted by Luciana Ferreira (lucgeral@gmail.com) on 2015-05-07T13:34:15Z No. of bitstreams: 2 Dissertação - Carmencita Ferreira Silva Assis - 2014.pdf: 1032992 bytes, checksum: dcfbc22b53a2352c6e65a7615ffb72b5 (MD5) license_rdf: 23148 bytes, checksum: 9da0b6dfac957114c6a7714714b86306 (MD5) Approved for entry into archive by Luciana Ferreira (lucgeral@gmail.com) on 2015-05-07T13:40:12Z (GMT) No. of bitstreams: 2 Dissertação - Carmencita Ferreira Silva Assis - 2014.pdf: 1032992 bytes, checksum: dcfbc22b53a2352c6e65a7615ffb72b5 (MD5) license_rdf: 23148 bytes, checksum: 9da0b6dfac957114c6a7714714b86306 (MD5) Made available in DSpace on 2015-05-07T13:40:12Z (GMT). No. of bitstreams: 2 Dissertação - Carmencita Ferreira Silva Assis - 2014.pdf: 1032992 bytes, checksum: dcfbc22b53a2352c6e65a7615ffb72b5 (MD5) license_rdf: 23148 bytes, checksum: 9da0b6dfac957114c6a7714714b86306 (MD5) Previous issue date: 2014-03-20 Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - CAPES This work aims to present te hniques for solving systems of linear equations, in its traditional formulation, where it sought to explore the referen es ommonly used in ourses in linear algebra and numeri al omputation, fo using on the dire t methods of Gauss elimination and LU fa torization. Troubleshooters established in the literature are ondu ted, in order to illustrate the operation and appli ation of su h methods to real problems, thus highlighting the possibility of inserting them in high s hool. The ontents were treated and exposed so that exemplify the diversity of areas in luding linear systems, su h as engineering, e onomi s and biology, showing the gains that an be a hieved by students if they have onta t with the methods as soon as possible. At the end we suggest the use of omputational resour es in math lasses, sin e the redu tion of time spent in algebrai manipulation will allow the tea her to deepen the on epts and to address larger systems, to enhan e the resolution perspe tive, and motivate the student in the learning pro ess. Este trabalho tem por objetivo apresentar té ni as de resolução de sistemas de equações lineares, em sua formulação tradi ional, onde se bus ou explorar as referên ias usualmente utilizadas em ursos de álgebra linear e ál ulo numéri o, enfo ando os métodos diretos de Eliminação de Gauss e Fatoração LU. Resoluções de problemas onsolidados na literatura são realizadas, om a nalidade de ilustrar o fun ionamento e apli ação de tais métodos em problemas reais, desta ando assim a possibilidade de inserção dos mesmos no Ensino Médio. Os onteúdos foram tratados e expostos de modo que exempli quem a diversidade de áreas que abrangem os sistemas lineares, tais omo engenharia, e onomia e biologia, mostrando os ganhos que podem ser al ançados pelos alunos, se tiverem ontato om os métodos o quanto antes. Ao nal sugere- se a utilização de re ursos omputa ionais nas aulas de matemáti a, uma vez que a redução do tempo empregado na manipulação algébri a permitirá que o professor possa aprofundar os on eitos e abordar sistemas de maior porte, que ampliem a perspe tiva de resolução, além de motivar o aluno no pro esso de aprendizagem.

APA, Harvard, Vancouver, ISO, and other styles

9

Pathanjali, Nandini. "Pipelined IEEE-754 Double Precision Floating Point Arithmetic Operators on Virtex FPGA’s." University of Cincinnati / OhioLINK, 2002. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1017085297.

Full text

APA, Harvard, Vancouver, ISO, and other styles

10

Lee, Eun-Joo. "Accurate and Robust Preconditioning Techniques for Solving General Sparse Linear Systems." UKnowledge, 2008. http://uknowledge.uky.edu/gradschool_diss/650.

Full text

APA, Harvard, Vancouver, ISO, and other styles

11

Donfack, Simplice. "Methods and algorithms for solving linear systems of equations on massively parallel computers." Thesis, Paris 11, 2012. http://www.theses.fr/2012PA112042.

Full text

Abstract:

Les processeurs multi-cœurs sont considérés de nos jours comme l'avenir des calculateurs et auront un impact important dans le calcul scientifique. Cette thèse présente une nouvelle approche de résolution des grands systèmes linéaires creux et denses, qui soit adaptée à l'exécution sur les futurs machines pétaflopiques et en particulier celles ayant un nombre important de cœurs. Compte tenu du coût croissant des communications comparé au temps dont les processeurs mettent pour effectuer les opérations arithmétiques, notre approche adopte le principe de minimisation des communications au prix de quelques calculs redondants et utilise plusieurs adaptations pour atteindre de meilleures performances sur les machines multi-cœurs. Nous décomposons le problème à résoudre en plusieurs phases qui sont ensuite mises en œuvre séparément. Dans la première partie, nous présentons un algorithme basé sur le partitionnement d'hypergraphe qui réduit considérablement le remplissage ("fill-in") induit lors de la factorisation LU des matrices creuses non symétriques. Dans la deuxième partie, nous présentons deux algorithmes de réduction de communication pour les factorisations LU et QR qui sont adaptés aux environnements multi-cœurs. La principale contribution de cette partie est de réorganiser les opérations de la factorisation de manière à réduire la sollicitation du bus tout en utilisant de façon optimale les ressources. Nous étendons ensuite ce travail aux clusters de processeurs multi-cœurs. Dans la troisième partie, nous présentons une nouvelle approche d'ordonnancement et d'optimisation. La localité des données et l'équilibrage des charges représentent un sérieux compromis pour le choix des méthodes d'ordonnancement. Sur les machines NUMA par exemple où la localité des données n'est pas une option, nous avons observé qu'en présence de perturbations systèmes (" OS noise"), les performances pouvaient rapidement se dégrader et devenir difficiles à prédire. Pour cela, nous présentons une approche combinant un ordonnancement statique et dynamique pour ordonnancer les tâches de nos algorithmes. Nos résultats obtenues sur plusieurs architectures montrent que tous nos algorithmes sont efficaces et conduisent à des gains de performances significatifs. Nous pouvons atteindre des améliorations de l'ordre de 30 à 110% par rapport aux correspondants de nos algorithmes dans les bibliothèques numériques bien connues de la littérature Multicore processors are considered to be nowadays the future of computing, and they will have an important impact in scientific computing. In this thesis, we study methods and algorithms for solving efficiently sparse and dense large linear systems on future petascale machines and in particular these having a significant number of cores. Due to the increasing communication cost compared to the time the processors take to perform arithmetic operations, our approach embrace the communication avoiding algorithm principle by doing some redundant computations and uses several adaptations to achieve better performance on multicore machines.We decompose the problem to solve into several phases that would be then designed or optimized separately. In the first part, we present an algorithm based on hypergraph partitioning and which considerably reduces the fill-in incurred in the LU factorization of sparse unsymmetric matrices. In the second part, we present two communication avoiding algorithms that are adapted to multicore environments. The main contribution of this part is to reorganize the computations such as to reduce bus contention and using efficiently resources. Then, we extend this work for clusters of multi-core processors. In the third part, we present a new scheduling and optimization approach. Data locality and load balancing are a serious trade-off in the choice of the scheduling strategy. On NUMA machines for example, where the data locality is not an option, we have observed that in the presence of noise, performance could quickly deteriorate and become difficult to predict. To overcome this bottleneck, we present an approach that combines a static and a dynamic scheduling approach to schedule the tasks of our algorithms.Our results obtained on several architectures show that all our algorithms are efficient and lead to significant performance gains. We can achieve from 30 up to 110% improvement over the corresponding routines of our algorithms in well known libraries

APA, Harvard, Vancouver, ISO, and other styles

12

Maurin, Julien. "Résolution des équations intégrales de surface par une méthode de décomposition de domaine et compression hiérarchique ACA : application à la simulation électromagnétique des larges plateformes." Phd thesis, Toulouse, INPT, 2015. http://oatao.univ-toulouse.fr/15113/1/maurin.pdf.

Full text

Abstract:

Cette étude s’inscrit dans le domaine de la simulation électromagnétique des problèmes de grande taille tels que la diffraction d’ondes planes par de larges plateformes et le rayonnement d’antennes aéroportées. Elle consiste à développer une méthode combinant décomposition en sous-domaines et compression hiérarchique des équations intégrales de frontière. Pour cela, nous rappelons dans un premier temps les points importants de la méthode des équations intégrales de frontière et de leur compression hiérarchique par l’algorithme ACA (Adaptive Cross Approximation). Ensuite, nous présentons la formulation IE-DDM (Integral Equations – Domain Decomposition Method) obtenue à partir d’une représentation intégrale des sous-domaines. Les matrices résultant de la discrétisation de cette formulation sont stockées au format H-matrice (matricehiérarchique). Un solveur spécialement adapté à la résolution de la formulation IE-DDM et à sa représentation hiérarchique a été conçu. Cette étude met en évidence l’efficacité de la décomposition en sous-domaines en tant que préconditionneur des équations intégrales. De plus, la méthode développée est rapide pour la résolution des problèmes à incidences multiples ainsi que la résolution des problèmes basses fréquences

APA, Harvard, Vancouver, ISO, and other styles

13

Rémy, Adrien. "Solving dense linear systems on accelerated multicore architectures." Thesis, Paris 11, 2015. http://www.theses.fr/2015PA112138/document.

Full text

Abstract:

Dans cette thèse de doctorat, nous étudions des algorithmes et des implémentations pour accélérer la résolution de systèmes linéaires denses en utilisant des architectures composées de processeurs multicœurs et d'accélérateurs. Nous nous concentrons sur des méthodes basées sur la factorisation LU. Le développement de notre code s'est fait dans le contexte de la bibliothèque MAGMA. Tout d'abord nous étudions différents solveurs CPU/GPU hybrides basés sur la factorisation LU. Ceux-ci visent à réduire le surcoût de communication dû au pivotage. Le premier est basé sur une stratégie de pivotage dite "communication avoiding" (CALU) alors que le deuxième utilise un préconditionnement aléatoire du système original pour éviter de pivoter (RBT). Nous montrons que ces deux méthodes surpassent le solveur utilisant la factorisation LU avec pivotage partiel quand elles sont utilisées sur des architectures hybrides multicœurs/GPUs. Ensuite nous développons des solveurs utilisant des techniques de randomisation appliquées sur des architectures hybrides utilisant des GPU Nvidia ou des coprocesseurs Intel Xeon Phi. Avec cette méthode, nous pouvons éviter l'important surcoût du pivotage tout en restant stable numériquement dans la plupart des cas. L'architecture hautement parallèle de ces accélérateurs nous permet d'effectuer la randomisation de notre système linéaire à un coût de calcul très faible par rapport à la durée de la factorisation. Finalement, nous étudions l'impact d'accès mémoire non uniformes (NUMA) sur la résolution de systèmes linéaires denses en utilisant un algorithme de factorisation LU. En particulier, nous illustrons comment un placement approprié des processus légers et des données sur une architecture NUMA peut améliorer les performances pour la factorisation du panel et accélérer de manière conséquente la factorisation LU globale. Nous montrons comment ces placements peuvent améliorer les performances quand ils sont appliqués à des solveurs hybrides multicœurs/GPU In this PhD thesis, we study algorithms and implementations to accelerate the solution of dense linear systems by using hybrid architectures with multicore processors and accelerators. We focus on methods based on the LU factorization and our code development takes place in the context of the MAGMA library. We study different hybrid CPU/GPU solvers based on the LU factorization which aim at reducing the communication overhead due to pivoting. The first one is based on a communication avoiding strategy of pivoting (CALU) while the second uses a random preconditioning of the original system to avoid pivoting (RBT). We show that both of these methods outperform the solver using LU factorization with partial pivoting when implemented on hybrid multicore/GPUs architectures. We also present new solvers based on randomization for hybrid architectures for Nvidia GPU or Intel Xeon Phi coprocessor. With this method, we can avoid the high cost of pivoting while remaining numerically stable in most cases. The highly parallel architecture of these accelerators allow us to perform the randomization of our linear system at a very low computational cost compared to the time of the factorization. Finally we investigate the impact of non-uniform memory accesses (NUMA) on the solution of dense general linear systems using an LU factorization algorithm. In particular we illustrate how an appropriate placement of the threads and data on a NUMA architecture can improve the performance of the panel factorization and consequently accelerate the global LU factorization. We show how these placements can improve the performance when applied to hybrid multicore/GPU solvers

APA, Harvard, Vancouver, ISO, and other styles

14

Khabou, Amal. "Dense matrix computations : communication cost and numerical stability." Phd thesis, Université Paris Sud - Paris XI, 2013. http://tel.archives-ouvertes.fr/tel-00833356.

Full text

Abstract:

Cette thèse traite d'une routine d'algèbre linéaire largement utilisée pour la résolution des systèmes li- néaires, il s'agit de la factorisation LU. Habituellement, pour calculer une telle décomposition, on utilise l'élimination de Gauss avec pivotage partiel (GEPP). La stabilité numérique de l'élimination de Gauss avec pivotage partiel est caractérisée par un facteur de croissance qui est reste assez petit en pratique. Toutefois, la version parallèle de cet algorithme ne permet pas d'atteindre les bornes inférieures qui ca- ractérisent le coût de communication pour un algorithme donné. En effet, la factorisation d'un bloc de colonnes constitue un goulot d'étranglement en termes de communication. Pour remédier à ce problème, Grigori et al [60] ont développé une factorisation LU qui minimise la communication(CALU) au prix de quelques calculs redondants. En théorie la borne supérieure du facteur de croissance de CALU est plus grande que celle de l'élimination de Gauss avec pivotage partiel, cependant CALU est stable en pratique. Pour améliorer la borne supérieure du facteur de croissance, nous étudions une nouvelle stra- tégie de pivotage utilisant la factorisation QR avec forte révélation de rang. Ainsi nous développons un nouvel algorithme pour la factorisation LU par blocs. La borne supérieure du facteur de croissance de cet algorithme est plus petite que celle de l'élimination de Gauss avec pivotage partiel. Cette stratégie de pivotage est ensuite combinée avec le pivotage basé sur un tournoi pour produire une factorisation LU qui minimise la communication et qui est plus stable que CALU. Pour les systèmes hiérarchiques, plusieurs niveaux de parallélisme sont disponibles. Cependant, aucune des méthodes précédemment ci- tées n'exploite pleinement ces ressources. Nous proposons et étudions alors deux algorithmes récursifs qui utilisent les mêmes principes que CALU mais qui sont plus appropriés pour des architectures à plu- sieurs niveaux de parallélisme. Pour analyser d'une façon précise et réaliste

APA, Harvard, Vancouver, ISO, and other styles

15

Estecahandy, Elodie. "Contribution à l'analyse mathématique et à la résolution numérique d'un problème inverse de scattering élasto-acoustique." Phd thesis, Université de Pau et des Pays de l'Adour, 2013. http://tel.archives-ouvertes.fr/tel-00880628.

Full text

Abstract:

La détermination de la forme d'un obstacle élastique immergé dans un milieu fluide à partir de mesures du champ d'onde diffracté est un problème d'un vif intérêt dans de nombreux domaines tels que le sonar, l'exploration géophysique et l'imagerie médicale. A cause de son caractère non-linéaire et mal posé, ce problème inverse de l'obstacle (IOP) est très difficile à résoudre, particulièrement d'un point de vue numérique. De plus, son étude requiert la compréhension de la théorie du problème de diffraction direct (DP) associé, et la maîtrise des méthodes de résolution correspondantes. Le travail accompli ici se rapporte à l'analyse mathématique et numérique du DP élasto-acoustique et de l'IOP. En particulier, nous avons développé un code de simulation numérique performant pour la propagation des ondes associée à ce type de milieux, basé sur une méthode de type DG qui emploie des éléments finis d'ordre supérieur et des éléments courbes à l'interface afin de mieux représenter l'interaction fluide-structure, et nous l'appliquons à la reconstruction d'objets par la mise en oeuvre d'une méthode de Newton régularisée.

APA, Harvard, Vancouver, ISO, and other styles

16

Lee, Chun Yi, and 李俊逸. "Levelized Incomplete LU factorization and It's application to semiconductor devices." Thesis, 1998. http://ndltd.ncl.edu.tw/handle/01828800521972743123.

Full text

Abstract:

碩士 國立中央大學 電機工程學系 86 In circuit simulation,the CPU time is spent in two parts:One is to transferthe circuit equation into the corresponding linear equation Ax= B.The other isto solve this matrix equation. In order to improve the simulation speed inmixed-level device and circuit simulation, we proposes to simplify the creation of the matrix equation by equivalent subcircuit,and speed upthe simulation by Levelized Incomplete LU factorization. Simply speaking,Levelized Incomplete LU method is to purge the high level fill-ins whichis produced from LU decomposition. In this way, we can economize on memoryspaceand calculation time. The Levelized ILU is used to solve Ax = B because it offers the good convergence of the direct methods,and the high speed,small memory space of the iteration method. The Ax=b is obtainedby transferring the Poisson equation and continuity equation into theirequivalent circuits to simplify the mixed-level simulation. Finally,we willapply the above methods to the simulation of some semiconductor devicesand verify their performance on the simulation and design of semiconductordevices.

APA, Harvard, Vancouver, ISO, and other styles

17

Trojek, Lukáš. "Bezmaticové předpodmínění." Master's thesis, 2012. http://www.nusl.cz/ntk/nusl-305569.

Full text

Abstract:

The diploma theses is focused on matrix-free preconditioning of a linear system. It gives a very brief introduction into the area of iterative methods, preconditioning and matrix-free environment. The emphasis is put on a detailed description of a variant of LU factorization which can be computed in a matrix-free manner and on a new technique connected with this factorization for preconditioning by incomplete LU factors in matrix-free environment. Its main features are storage of only one of the two incomplete factors and low memory costs during the computation of the stored factor. The thesis closes with numerical experiments demonstrating the efficiency of the proposed technique.

APA, Harvard, Vancouver, ISO, and other styles

18

Yu-Ming, Sun, and 孫郁明. "Levelized Incomplete LU Factorization and Its Application to Quasi-Static MOSFET C-V Simulation." Thesis, 2001. http://ndltd.ncl.edu.tw/handle/30551558454138423399.

Full text

Abstract:

碩士 國立中央大學 電機工程研究所 89 In this thesis, we focus our attention on the measurement of the quasi-static MOSFET capacitance and Levelized incomplete LU method. Direct LU decomposition is not suitable for large-scale device simulation, because a lot of fill-ins will be generated during LU decomposition. Levelized incomplete LU method provides a good way to improve traditional LU decomposition. The structure of the metal gate and the method of the interleaving variable permutation are used. The use of the metal gate can save some rectangular grids which are used to describe the poly gate terminal. The use of the interleaving method reduces the number of fill-ins. In the meantime, the main focus of this thesis is on the measurement of gate to substrate, gate to source, and gate to drain capacitances of MOSFET. These measurements are implemented by charge method, sinusoidal method, and ramp method. The results of C-V characteristics are compared. Finally, we try to reduce memory space by changing the data type, but it is quite difficult. Therefore, we proposed decoupled method to reduce the number of nonzero entries.

APA, Harvard, Vancouver, ISO, and other styles

19

"Improving the Execution Time of Large System Simulations." Master's thesis, 2012. http://hdl.handle.net/2286/R.I.15851.

Full text

Abstract:

abstract: Today, the electric power system faces new challenges from rapid developing technology and the growing concern about environmental problems. The future of the power system under these new challenges needs to be planned and studied. However, due to the high degree of computational complexity of the optimization problem, conducting a system planning study which takes into account the market structure and environmental constraints on a large-scale power system is computationally taxing. To improve the execution time of large system simulations, such as the system planning study, two possible strategies are proposed in this thesis. The first one is to implement a relative new factorization method, known as the multifrontal method, to speed up the solution of the sparse linear matrix equations within the large system simulations. The performance of the multifrontal method implemented by UMFAPACK is compared with traditional LU factorization on a wide range of power-system matrices. The results show that the multifrontal method is superior to traditional LU factorization on relatively denser matrices found in other specialty areas, but has poor performance on the more sparse matrices that occur in power-system applications. This result suggests that multifrontal methods may not be an effective way to improve execution time for large system simulation and power system engineers should evaluate the performance of the multifrontal method before applying it to their applications. The second strategy is to develop a small dc equivalent of the large-scale network with satisfactory accuracy for the large-scale system simulations. In this thesis, a modified Ward equivalent is generated for a large-scale power system, such as the full Electric Reliability Council of Texas (ERCOT) system. In this equivalent, all the generators in the full model are retained integrally. The accuracy of the modified Ward equivalent is validated and the equivalent is used to conduct the optimal generation investment planning study. By using the dc equivalent, the execution time for optimal generation investment planning is greatly reduced. Different scenarios are modeled to study the impact of fuel prices, environmental constraints and incentives for renewable energy on future investment and retirement in generation. Dissertation/Thesis M.S. Electrical Engineering 2012

APA, Harvard, Vancouver, ISO, and other styles

20

Ellis, Apollo Isaac Orion. "Jack Rabbit : an effective Cell BE programming system for high performance parallelism." Thesis, 2011. http://hdl.handle.net/2152/ETD-UT-2011-05-3624.

Full text

Abstract:

The Cell processor is an example of the trade-offs made when designing a mass market power efficient multi-core machine, but the machine-exposing architecture and raw communication mechanisms of Cell are hard to manage for a programmer. Cell's design is simple and causes software complexity to go up in the areas of achieving low threading overhead, good bandwidth efficiency, and load balance. Several attempts have been made to produce efficient and effective programming systems for Cell, but the attempts have been too specialized and thus fall short. We present Jack Rabbit, an efficient thread pool work queue implementation, with load balancing mechanisms and double buffering. Our system incurs low threading overhead, gets good load balance, and achieves bandwidth efficiency. Our system represents a step towards an effective way to program Cell and any similar current or future processors. text

APA, Harvard, Vancouver, ISO, and other styles

21

Hsuan-Chu, Li, and 李宣助. "Studies on Generalized Vandermonde Matrices: Their Determinants, Inverses, Explicit LU Factorizations, with Applications." Thesis, 2007. http://ndltd.ncl.edu.tw/handle/98917357188189258047.

Full text

Abstract:

博士 國立政治大學 應用數學研究所 95 Classical and generalized Vandermonde matrices are ubiquitous in mathematics, and various studies on their determinants, inverses, explicit LU factorizations with applications are done recently by many authors. In this thesis we shall focus on two topics: One is generalized Vandermonde matrices revisited and the other is various decompositions of some generalized Vandermonde matrices. In the first topic, we prove the well-known determinant formulas of two types of generalized Vandermonde matrices using only mathematical induction, different from the proofs of Fulin Qian's and Flowe-Harris'. In the second topic, which constitutes the main results of this thesis, we devote ourself to two themes. Firstly, we study a special class which is the transpose of the generalized Vandermonde matrix of the first type and succeed in obtaining its LU factorization in an explicit form. Furthermore, we express the LU factorization into 1-banded factorizations and get the inverse explicitly. Secondly, we consider a totally positive(TP) generalized Vandermonde matrix and obtain its unique LU factorization without using Schur functions. The result is better than Demmel and Koev's which is involved Schur functions. As by-products, we gain the determinant and the inverse of the required matrix and express any Schur function in an explicit form. Basing on the above result, we obtain a way to calculate Kostka numbers by expanding Schur functions.

APA, Harvard, Vancouver, ISO, and other styles

22

Hwang, Tsung-Min, and 黃聰明. "Rank Revealing QR, LU Factorizations and Relaxing Path Following Algorithm for Linear and Convex Quadratic Programming." Thesis, 1994. http://ndltd.ncl.edu.tw/handle/98753608933588694027.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'LU factorization'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles