To see the other types of publications on this topic, follow the link: Massively Parallel Processing (MPP).

Dissertations / Theses on the topic 'Massively Parallel Processing (MPP)'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 33 dissertations / theses for your research on the topic 'Massively Parallel Processing (MPP).'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Kumm, Holger Thomas. "Methodologies for the synthesis of cost-effective modular-MPC configurations for image processing applications." Thesis, Brunel University, 1995. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.296194.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Ervin, Brian. "Neural Spike Detection and Classification Using Massively Parallel Graphics Processing." University of Cincinnati / OhioLINK, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1377868773.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Nordström, Tomas. "Designing and using massively parallel computers for artificial neural networks." Licentiate thesis, Luleå tekniska universitet, Signaler och system, 1991. http://urn.kb.se/resolve?urn=urn:nbn:se:ltu:diva-17900.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Hymel, Shawn. "Massively Parallel Hidden Markov Models for Wireless Applications." Thesis, Virginia Tech, 2011. http://hdl.handle.net/10919/36017.

Full text
Abstract:
Cognitive radio is a growing field in communications which allows a radio to automatically configure its transmission or reception properties in order to reduce interference, provide better quality of service, or allow for more users in a given spectrum. Such processes require several complex features that are currently being utilized in cognitive radio. Two such features, spectrum sensing and identification, have been implemented in numerous ways, however, they generally suffer from high computational complexity. Additionally, Hidden Markov Models (HMMs) are a widely used mathematical modeling tool used in various fields of engineering and sciences. In electrical and computer engineering, it is used in several areas, including speech recognition, handwriting recognition, artificial intelligence, queuing theory, and are used to model fading in communication channels. The research presented in this thesis proposes a new approach to spectrum identification using a parallel implementation of Hidden Markov Models. Algorithms involving HMMs are usually implemented in the traditional serial manner, which have prohibitively long runtimes. In this work, we study their use in parallel implementations and compare our approach to traditional serial implementations. Timing and power measurements are taken and used to show that the parallel implementation can achieve well over 100Ã speedup in certain situations. To demonstrate the utility of this new parallel algorithm using graphics processing units (GPUs), a new method for signal identification is proposed for both serial and parallel implementations using HMMs. The method achieved high recognition at -10 dB Eb/N0. HMMs can benefit from parallel implementation in certain circumstances, specifically, in models that have many states or when multiple models are used in conjunction.<br>Master of Science
APA, Harvard, Vancouver, ISO, and other styles
5

Savaş, Süleyman. "Linear Algebra for Array Signal Processing on a Massively Parallel Dataflow Architecture." Thesis, Halmstad University, School of Information Science, Computer and Electrical Engineering (IDE), 2008. http://urn.kb.se/resolve?urn=urn:nbn:se:hh:diva-2192.

Full text
Abstract:
<p>This thesis provides the deliberations about the implementation of Gentleman-Kung systolic array for QR decomposition using Givens Rotations within the context of radar signal </p><p>processing. The systolic array of Givens Rotations is implemented and analysed using a massively parallel processor array (MPPA), Ambric Am2045. The tools that are dedicated to the MPPA are tested in terms of engineering efficiency. aDesigner, which is built on eclipse environment, is used for programming, simulating and performance analysing. aDesigner has been produced for Ambric chip family. 2 parallel matrix multiplications have been implemented </p><p>to get familiar with the architecture and tools. Moreover different sized systolic arrays are implemented and compared with each other. For programming, ajava and astruct languages are provided. However floating point numbers are not supported by the provided languages. </p><p>Thus fixed point arithmetic is used in systolic array implementation of Givens Rotations. Stable and precise numerical results are obtained as outputs of the algorithms. However the analysis </p><p>results are not reliable because of the performance analysis tools.</p>
APA, Harvard, Vancouver, ISO, and other styles
6

Savaş, Süleyman. "Linear Algebra for Array Signal Processing on a Massively Parallel Dataflow Architecture." Thesis, Halmstad University, School of Information Science, Computer and Electrical Engineering (IDE), 2009. http://urn.kb.se/resolve?urn=urn:nbn:se:hh:diva-4137.

Full text
Abstract:
<p>This thesis provides the deliberations about the implementation of Gentleman-Kung systolic array for QR decomposition using Givens Rotations within the context of radar signal processing. The systolic array of Givens Rotations is implemented and analysed using a massively parallel processor array (MPPA), Ambric Am2045. The tools that are dedicated to the MPPA are tested in terms of engineering efficiency. aDesigner, which is built on eclipse environment, is used for programming, simulating and performance analysing. aDesigner has been produced for Ambric chip family. 2 parallel matrix multiplications have been implemented to get familiar with the architecture and tools. Moreover different sized systolic arrays are implemented and compared with each other. For programming, ajava and astruct languages are provided. However floating point numbers are not supported by the provided languages. Thus fixed point arithmetic is used in systolic array implementation of Givens Rotations. Stable </p><p>and precise numerical results are obtained as outputs of the algorithms. However the analysis results are not reliable because of the performance analysis tools.</p>
APA, Harvard, Vancouver, ISO, and other styles
7

Ediger, David. "Analyzing hybrid architectures for massively parallel graph analysis." Diss., Georgia Institute of Technology, 2013. http://hdl.handle.net/1853/47659.

Full text
Abstract:
The quantity of rich, semi-structured data generated by sensor networks, scientific simulation, business activity, and the Internet grows daily. The objective of this research is to investigate architectural requirements for emerging applications in massive graph analysis. Using emerging hybrid systems, we will map applications to architectures and close the loop between software and hardware design in this application space. Parallel algorithms and specialized machine architectures are necessary to handle the immense size and rate of change of today's graph data. To highlight the impact of this work, we describe a number of relevant application areas ranging from biology to business and cybersecurity. With several proposed architectures for massively parallel graph analysis, we investigate the interplay of hardware, algorithm, data, and programming model through real-world experiments and simulations. We demonstrate techniques for obtaining parallel scaling on multithreaded systems using graph algorithms that are orders of magnitude faster and larger than the state of the art. The outcome of this work is a proposed hybrid architecture for massive-scale analytics that leverages key aspects of data-parallel and highly multithreaded systems. In simulations, the hybrid systems incorporating a mix of multithreaded, shared memory systems and solid state disks performed up to twice as fast as either homogeneous system alone on graphs with as many as 18 trillion edges.
APA, Harvard, Vancouver, ISO, and other styles
8

Walsh, Declan. "Design and implementation of massively parallel fine-grained processor arrays." Thesis, University of Manchester, 2015. https://www.research.manchester.ac.uk/portal/en/theses/design-and-implementation-of-massively-parallel-finegrained-processor-arrays(e0e03bd5-4feb-4d66-8d4b-0e057684e498).html.

Full text
Abstract:
This thesis investigates the use of massively parallel fine-grained processor arrays to increase computational performance. As processors move towards multi-core processing, more energy-efficient processors can be designed by increasing the number of processor cores on a single chip rather than increasing the clock frequency of a single processor. This can be done by making processor cores less complex, but increasing the number of processor cores on a chip. Using this philosophy, a processor core can be reduced in complexity, area, and speed to form a very small processor which can still perform basic arithmetic operations. Due to the small area occupation this can be multiplied and scaled to form a large scale parallel processor array to offer a significant performance. Following this design methodology, two fine-grained parallel processor arrays are designed which aim to achieve a small area occupation with each individual processor so that a larger array can be implemented over a given area. To demonstrate scalability and performance, SIMD parallel processor array is designed for implementation on an FPGA where each processor can be implemented using four ‘slices’ of a Xilinx FPGA. With such small area utilization, a large fine-grained processor can be implemented on these FPGAs. A 32 × 32 processor array is implemented and fast processing demonstrated using image processing tasks. An event-driven MIMD parallel processor array is also designed which occupies a small amount of area and can be scaled up to form much larger arrays. The event-driven approach allows the processor to enter an idle mode when no events are occurring local to the processor, reducing power consumption. The processor can switch to operational mode when events are detected. The processor core is designed with a multi-bit data path and ALU and contains its own instruction memory making the array a multi-core processor array. With area occupation of primary concern, the processor is relatively simple and connects with its four nearest direct neighbours. A small 8 × 8 prototype chip is implemented in a 65 nm CMOS technology process which can operate at a clock frequency of 80 MHz and offer a peak performance of 5.12 GOPS which can be scaled up to larger arrays. An application of the event-driven processor array is demonstrated using a simulation model of the processor. An event-driven algorithm is demonstrated to perform distributed control of distributed manipulator simulator by separating objects based on their physical properties.
APA, Harvard, Vancouver, ISO, and other styles
9

Joseph, Rosh John. "Investigating the user-acceptability of a massively parallel computing solution for image processing workstations." Thesis, Brunel University, 1998. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.242981.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Chaudhari, Gunavant Dinkar. "Simulation and emulation of massively parallel processor for solving constraint satisfaction problems based on oracles." PDXScholar, 2011. https://pdxscholar.library.pdx.edu/open_access_etds/11.

Full text
Abstract:
Most part of my thesis is devoted to efficient automated logic synthesis of oracle processors. These Oracle Processors are of interest to several modern technologies, including Scheduling and Allocation, Image Processing and Robot Vision, Computer Aided Design, Games and Puzzles, and Cellular Automata, but so far the most important practical application is to build logic circuits to solve various practical Constraint Satisfaction Problems in Intelligent Robotics. For instance, robot path planning can be reduced to Satisfiability. In short, an oracle is a circuit that has some proposition of solution on the inputs and answers yes/no to this proposition. In other language, it is a predicate or a concept-checking machine. Oracles have many applications in AI and theoretical computer science but so far they were not used much in hardware architectures. Systematic logic synthesis methodologies for oracle circuits were so far not a subject of a special research. It is not known how big advantages these processors will bring when compared to parallel processing with CUDA/GPU processors, or standard PC processing. My interest in this thesis is only in architectural and logic synthesis aspects and not in physical (technological) design aspects of these circuits. In future, these circuits will be realized using reversible, nano and some new technologies, but the interest in this thesis is not in the future realization technologies. We want just to answer the following question: Is there any speed advantage of the new oracle-based architectures, when compared with standard serial or parallel processors?
APA, Harvard, Vancouver, ISO, and other styles
11

Verdú, Mulà Javier. "Analysis and architectural support for parallel stateful packet processing." Doctoral thesis, Universitat Politècnica de Catalunya, 2008. http://hdl.handle.net/10803/6027.

Full text
Abstract:
The evolution of network services is closely related to the network technology trend. Originally network nodes forwarded packets from a source to a destination in the network by executing lightweight packet processing, or even negligible workloads. As links provide more complex services, packet processing demands the execution of more computational intensive applications. Complex network applications deal with both packet header and payload (i.e. packet contents) to provide upper layer network services, such as enhanced security, system utilization policies, and video on demand management.<br/>Applications that provide complex network services arise two key capabilities that differ from the low layer network applications: a) deep packet inspection examines the packet payload tipically searching for a matching string or regular expression, and b) stateful processing keeps track information of previous packet processing, unlike other applications that don't keep any data about other packet processing. In most cases, deep packet inspection also integrates stateful processing.<br/>Computer architecture researches aim to maximize the system throughput to sustain the required network processing performance as well as other demands, such as memory and I/O bandwidth. In fact, there are different processor architectures depending on the sharing degree of hardware resources among streams (i.e. hardware context). Multicore architectures present multiple processing engines within a single chip that share cache levels of memory hierarchy and interconnection network. Multithreaded architectures integrates multiple streams in a single processing engine sharing functional units, register file, fecth unit, and inner levels of cache hierarchy. Scalable multicore multithreaded architectures emerge as a solution to overcome the requirements of high throughput systems. We call massively multithreaded architectures to the architectures that comprise tens to hundreds of streams distributed across multiple cores on a chip. Nevertheless, the efficient utilization of these architectures depends on the application characteristics. On one hand, emerging network applications show large computational workloads with significant variations in the packet processing behavior. Then, it is important to analyze the behavior of each packet processing to optimally assign packets to threads (i.e. software context) for reducing any negative interaction among them. On the other hand, network applications present Packet Level Parallelism (PLP) in which several packets can be processed in parallel. As in other paradigms, dependencies among packets limit the amount of PLP. Lower network layer applications show negligible packet dependencies. In contrast, complex upper network applications show dependencies among packets leading to reduce the amount of PLP.<br/>In this thesis, we address the limitations of parallelism in stateful network applications to maximize the throughput of advanced network devices. This dissertation comprises three complementary sets of contributions focused on: network analysis, workload characterization and architectural proposal.<br/>The network analysis evaluates the impact of network traffic on stateful network applications. We specially study the impact of network traffic aggregation on memory hierarchy performance. We categorize and characterize network applications according to their data management. The results point out that stateful processing presents reduced instruction level parallelism and high rate of long latency memory accesses. Our analysis reveal that stateful applications expose a variety of levels of parallelism related to stateful data categories. Thus, we propose the MultiLayer Processing (MLP) as an execution model to exploit multiple levels of parallelism. The MLP is a thread migration based mechanism that increases the sinergy among streams in the memory hierarchy and alleviates the contention in critical sections of parallel stateful workloads.
APA, Harvard, Vancouver, ISO, and other styles
12

Warneke, Daniel [Verfasser], and Odej [Akademischer Betreuer] Kao. "Massively Parallel Data Processing on Infrastructure as a Service Platforms / Daniel Warneke. Betreuer: Odej Kao." Berlin : Universitätsbibliothek der Technischen Universität Berlin, 2011. http://d-nb.info/1016533292/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
13

Cruz-Rivera, Jose L. "Elements of an applications-driven optical interconnect technology modeling framework for ultracompact massively parallel processing systems." Diss., Georgia Institute of Technology, 1996. http://hdl.handle.net/1853/15688.

Full text
APA, Harvard, Vancouver, ISO, and other styles
14

Slotta, Douglas J. "Structural Design Using Cellular Automata." Thesis, Virginia Tech, 2001. http://hdl.handle.net/10919/33368.

Full text
Abstract:
Traditional parallel methods for structural design do not scale well. This thesis discusses the application of massively scalable cellular automata (CA) techniques to structural design. There are two sets of CA rules, one used to propagate stresses and strains, and one to perform design analysis. These rules can be applied serially, periodically, or concurrently, and Jacobi or Gauss-Seidel style updating can be done. These options are compared with respect to convergence, speed, and stability.<br>Master of Science
APA, Harvard, Vancouver, ISO, and other styles
15

Lacy, William Stephen. "Design issues for interconnection networks in massively parallel processing systems under advanced VLSI and packaging constraints." Diss., Georgia Institute of Technology, 1996. http://hdl.handle.net/1853/14690.

Full text
APA, Harvard, Vancouver, ISO, and other styles
16

Alagoda, Geoffrey N. "VLSI implementation of a massively parallel wavelet based zerotree coder for the intelligent pixel array." Thesis, Edith Cowan University, Research Online, Perth, Western Australia, 2001. https://ro.ecu.edu.au/theses/1078.

Full text
Abstract:
In the span of a few years, mobile multimedia communication has rapidly become a significant area of research and development constantly challenging boundaries on a variety of technologic fronts. Mobile video communications in particular encompasses a number of technical hurdles that generally steer technological advancements towards devices that are low in complexity, low in power usage yet perform the given task efficiently. Devices of this nature have been made available through the use of massively parallel processing arrays such as the Intelligent Pixel Processing Array. The Intelligent Pixel Processing array is a novel concept that integrates a parallel image capture mechanism, a parallel processing component and a parallel display component into a single chip solution geared toward mobile communications environments, be it a PDA based system or the video communicator wristwatch portrayed in "Dick Tracy" episodes. This thesis details work performed to provide an efficient, low power, low complexity solution surrounding the massively parallel implementation of a zerotree entropy codec for the Intelligent Pixel Array.
APA, Harvard, Vancouver, ISO, and other styles
17

Obrecht, Christian. "High performance lattice Boltzmann solvers on massively parallel architectures with applications to building aeraulics." Phd thesis, INSA de Lyon, 2012. http://tel.archives-ouvertes.fr/tel-00776986.

Full text
Abstract:
With the advent of low-energy buildings, the need for accurate building performance simulations has significantly increased. However, for the time being, the thermo-aeraulic effects are often taken into account through simplified or even empirical models, which fail to provide the expected accuracy. Resorting to computational fluid dynamics seems therefore unavoidable, but the required computational effort is in general prohibitive. The joint use of innovative approaches such as the lattice Boltzmann method (LBM) and massively parallel computing devices such as graphics processing units (GPUs) could help to overcome these limits. The present research work is devoted to explore the potential of such a strategy. The lattice Boltzmann method, which is based on a discretised version of the Boltzmann equation, is an explicit approach offering numerous attractive features: accuracy, stability, ability to handle complex geometries, etc. It is therefore an interesting alternative to the direct solving of the Navier-Stokes equations using classic numerical analysis. From an algorithmic standpoint, the LBM is well-suited for parallel implementations. The use of graphics processors to perform general purpose computations is increasingly widespread in high performance computing. These massively parallel circuits provide up to now unrivalled performance at a rather moderate cost. Yet, due to numerous hardware induced constraints, GPU programming is quite complex and the possible benefits in performance depend strongly on the algorithmic nature of the targeted application. For LBM, GPU implementations currently provide performance two orders of magnitude higher than a weakly optimised sequential CPU implementation. The present thesis consists of a collection of nine articles published in international journals and proceedings of international conferences (the last one being under review). These contributions address the issues related to single-GPU implementations of the LBM and the optimisation of memory accesses, as well as multi-GPU implementations and the modelling of inter-GPU and internode communication. In addition, we outline several extensions to the LBM, which appear essential to perform actual building thermo-aeraulic simulations. The test cases we used to validate our codes account for the strong potential of GPU LBM solvers in practice.
APA, Harvard, Vancouver, ISO, and other styles
18

Lohrmann, Björn [Verfasser], Odej [Akademischer Betreuer] Kao, Odej [Gutachter] Kao, Johann-Christoph [Gutachter] Freytag, and Kai-Uwe [Gutachter] Sattler. "Massively parallel stream processing with latency guarantees / Björn Lohrmann ; Gutachter: Odej Kao, Johann-Christoph Freytag, Kai-Uwe Sattler ; Betreuer: Odej Kao." Berlin : Technische Universität Berlin, 2016. http://d-nb.info/1156181100/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
19

Edwards, Thomas David. "Optimising a fluid plasma turbulence simulation on modern high performance computers." Thesis, University of Edinburgh, 2010. http://hdl.handle.net/1842/4681.

Full text
Abstract:
Nuclear fusion offers the potential of almost limitless energy from sea water and lithium without the dangers of carbon emissions or long term radioactive waste. At the forefront of fusion technology are the tokamaks, toroidal magnetic confinement devices that contain miniature stars on Earth. Nuclei can only fuse by overcoming the strong electrostatic forces between them which requires high temperatures and pressures. The temperatures in a tokamak are so great that the Deuterium-Tritium fusion fuel forms a plasma which must be kept hot and under pressure to maintain the fusion reaction. Turbulence in the plasma causes disruption by transporting mass and energy away from this core, reducing the efficiency of the reaction. Understanding and controlling the mechanisms of plasma turbulence is key to building a fusion reactor capable of producing sustained output. The extreme temperatures make detailed empirical observations difficult to acquire, so numerical simulations are used as an additional method of investigation. One numerical model used to study turbulence and diffusion is CENTORI, a direct two-fluid magneto-hydrodynamic simulation of a tokamak plasma developed by the Culham Centre for Fusion Energy (CCFE formerly UKAEA:Fusion). It simulates the entire tokamak plasma with realistic geometry, evolving bulk plasma quantities like pressure, density and temperature through millions of timesteps. This requires CENTORI to run in parallel on a Massively Parallel Processing (MPP) supercomputer to produce results in an acceptable time. Any improvements in CENTORI’s performance increases the rate and/or total number of results that can be obtained from access to supercomputer resources. This thesis presents the substantial effort to optimise CENTORI on the current generation of academic supercomputers. It investigates and reviews the properties of contemporary computer architectures then proposes, implements and executes a benchmark suite of CENTORI’s fundamental kernels. The suite is used to compare the performance of three competing memory layouts of the primary vector data structure using a selection of compilers on a variety of computer architectures. The results show there is no optimal memory layout on all platforms so a flexible optimisation strategy was adopted to pursue “portable” optimisation i.e optimisations that can easily be added, adapted or removed from future platforms depending on their performance. This required designing an interface to functions and datatypes that separate CENTORI’s fundamental algorithms from repetitive, low-level implementation details. This approach offered multiple benefits including: the clearer representation of CENTORI’s core equations as mathematical expressions in Fortran source code allows rapid prototyping and development of new features; the reduction in the total data volume by a factor of three reduces the amount of data transferred over the memory bus to almost a third; and the reduction in the number of intense floating point kernels reduces the effort of optimising the application on new platforms. The project proceeds to rewrite CENTORI using the new Application Programming Interface (API) and evaluates two optimised implementations. The first is a traditional library implementation that uses hand optimised subroutines to implement the library functions. The second uses a dynamic optimisation engine to perform automatic stripmining to improve the performance of the memory hierarchy. The automatic stripmining implementation uses lazy evaluation to delay calculations until absolutely necessary, allowing it to identify temporary data structures and minimise them for optimal cache use. This novel technique is combined with highly optimised implementations of the kernel operations and optimised parallel communication routines to produce a significant improvement in CENTORI’s performance. The maximum measured speed up of the optimised versions over the original code was 3.4 times on 128 processors on HPCx, 2.8 times on 1024 processors on HECToR and 2.3 times on 256 processors on HPC-FF.
APA, Harvard, Vancouver, ISO, and other styles
20

Lezar, Evan. "GPU acceleration of matrix-based methods in computational electromagnetics." Thesis, Stellenbosch : University of Stellenbosch, 2011. http://hdl.handle.net/10019.1/6507.

Full text
Abstract:
Thesis (PhD (Electrical and Electronic Engineering))--University of Stellenbosch, 2011.<br>ENGLISH ABSTRACT: This work considers the acceleration of matrix-based computational electromagnetic (CEM) techniques using graphics processing units (GPUs). These massively parallel processors have gained much support since late 2006, with software tools such as CUDA and OpenCL greatly simplifying the process of harnessing the computational power of these devices. As with any advances in computation, the use of these devices enables the modelling of more complex problems, which in turn should give rise to better solutions to a number of global challenges faced at present. For the purpose of this dissertation, CUDA is used in an investigation of the acceleration of two methods in CEM that are used to tackle a variety of problems. The first of these is the Method of Moments (MOM) which is typically used to model radiation and scattering problems, with the latter begin considered here. For the CUDA acceleration of the MOM presented here, the assembly and subsequent solution of the matrix equation associated with the method are considered. This is done for both single and double precision oating point matrices. For the solution of the matrix equation, general dense linear algebra techniques are used, which allow for the use of a vast expanse of existing knowledge on the subject. This also means that implementations developed here along with the results presented are immediately applicable to the same wide array of applications where these methods are employed. Both the assembly and solution of the matrix equation implementations presented result in signi cant speedups over multi-core CPU implementations, with speedups of up to 300x and 10x, respectively, being measured. The implementations presented also overcome one of the major limitations in the use of GPUs as accelerators (that of limited memory capacity) with problems up to 16 times larger than would normally be possible being solved. The second matrix-based technique considered is the Finite Element Method (FEM), which allows for the accurate modelling of complex geometric structures including non-uniform dielectric and magnetic properties of materials, and is particularly well suited to handling bounded structures such as waveguide. In this work the CUDA acceleration of the cutoff and dispersion analysis of three waveguide configurations is presented. The modelling of these problems using an open-source software package, FEniCS, is also discussed. Once again, the problem can be approached from a linear algebra perspective, with the formulation in this case resulting in a generalised eigenvalue (GEV) problem. For the problems considered, a total solution speedup of up to 7x is measured for the solution of the generalised eigenvalue problem, with up to 22x being attained for the solution of the standard eigenvalue problem that forms part of the GEV problem.<br>AFRIKAANSE OPSOMMING: In hierdie werkstuk word die versnelling van matriksmetodes in numeriese elektromagnetika (NEM) deur die gebruik van grafiese verwerkingseenhede (GVEe) oorweeg. Die gebruik van hierdie verwerkingseenhede is aansienlik vergemaklik in 2006 deur sagteware pakette soos CUDA en OpenCL. Hierdie toestelle, soos ander verbeterings in verwerkings vermoe, maak dit moontlik om meer komplekse probleme op te los. Hierdie stel wetenskaplikes weer in staat om globale uitdagings beter aan te pak. In hierdie proefskrif word CUDA gebruik om ondersoek in te stel na die versnelling van twee metodes in NEM, naamlik die Moment Metode (MOM) en die Eindige Element Metode (EEM). Die MOM word tipies gebruik om stralings- en weerkaatsingsprobleme op te los. Hier word slegs na die weerkaatsingsprobleme gekyk. CUDA word gebruik om die opstel van die MOM matriks en ook die daaropvolgende oplossing van die matriksvergelyking wat met die metode gepaard gaan te bespoedig. Algemene digte lineere algebra tegnieke word benut om die matriksvergelykings op te los. Dit stel die magdom bestaande kennis in die vagebied beskikbaar vir die oplossing, en gee ook aanleiding daartoe dat enige implementasies wat ontwikkel word en resultate wat verkry word ook betrekking het tot 'n wye verskeidenheid probleme wat die lineere algebra metodes gebruik. Daar is gevind dat beide die opstelling van die matriks en die oplossing van die matriksvergelyking aansienlik vinniger is as veelverwerker SVE implementasies. 'n Verselling van tot 300x en 10x onderkeidelik is gemeet vir die opstel en oplos fases. Die hoeveelheid geheue beskikbaar tot die GVE is een van die belangrike beperkinge vir die gebruik van GVEe vir groot probleme. Hierdie beperking word hierin oorkom en probleme wat selfs 16 keer groter is as die GVE se beskikbare geheue word geakkommodeer en suksesvol opgelos. Die Eindige Element Metode word op sy beurt gebruik om komplekse geometriee asook nieuniforme materiaaleienskappe te modelleer. Die EEM is ook baie geskik om begrensde strukture soos golfgeleiers te hanteer. Hier word CUDA gebruik of om die afsny- en dispersieanalise van drie gol eierkonfigurasies te versnel. Die implementasie van hierdie probleme word gedoen deur 'n versameling oopbronkode wat bekend staan as FEniCS, wat ook hierin bespreek word. Die probleme wat ontstaan in die EEM kan weereens vanaf 'n lineere algebra uitganspunt benader word. In hierdie geval lei die formulering tot 'n algemene eiewaardeprobleem. Vir die gol eier probleme wat ondersoek word is gevind dat die algemene eiewaardeprobleem met tot 7x versnel word. Die standaard eiewaardeprobleem wat 'n stap is in die oplossing van die algemene eiewaardeprobleem is met tot 22x versnel.
APA, Harvard, Vancouver, ISO, and other styles
21

Mašek, Jan. "Automatické strojové metody získávání znalostí z multimediálních dat." Doctoral thesis, Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií, 2016. http://www.nusl.cz/ntk/nusl-256538.

Full text
Abstract:
The quality and efficient processing of increasing amount of multimedia data is nowadays becoming increasingly needed to obtain some knowledge of this data. The thesis deals with a research, implementation, optimization and the experimental verification of automatic machine learning methods for multimedia data analysis. Created approach achieves higher accuracy in comparison with common methods, when applied on selected examples. Selected results were published in journals with impact factor [1, 2]. For these reasons special parallel computing methods were created in this work. These methods use massively parallel hardware to save electric energy and computing time and for achieving better result while solving problems. Computations which usually take days can be computed in minutes using new optimized methods. The functionality of created methods was verified on selected problems: artery detection from ultrasound images with further classifying of artery disease, the buildings detection from aerial images for obtaining geographical coordinates, the detection of materials contained in meteorite from CT images, the processing of huge databases of structured data, the classification of metallurgical materials with using laser induced breakdown spectroscopy and the automatic classification of emotions from texts.
APA, Harvard, Vancouver, ISO, and other styles
22

Benosman, Ridha Mohammed. "Conception et évaluation de performance d'un Bus applicatif, massivement parallèle et orienté service." Thesis, Paris, CNAM, 2013. http://www.theses.fr/2013CNAM0889/document.

Full text
Abstract:
Enterprise Service Bus (ESB) est actuellement l'approche la plus prometteuse pour l'implémentation d'une architecture orientée services (SOA : Service-Oriented Architecture) par l'intégration des différentes applications isolées dans une plateforme centralisée. De nombreuses solutions d'intégration à base d'ESB on été proposées, elles sont soit open-source comme : Mule, Petals, ou encore Fuse, soit propriétaires tels que : Sonic ESB, IBM WebSphere Message Broker, ou Oracle ESB. Cependant, il n'en existe aucune en mesure de traiter, à la fois des aspects : d'intégration et de traitement massivement parallèle, du moins à notre connaissance. L'intégration du parallélisme dans le traitement est un moyen de tirer profit des technologies multicœurs/multiprocesseurs qui améliorent considérablement les performances des ESBs.Toutefois, cette intégration est une démarche complexe et soulève des problèmes à plusieurs niveaux : communication, synchronisation, partage de données, etc.Dans cette thèse, nous présentons l'étude d'une nouvelle architecture massivement parallèle de type ESB<br>Enterprise service bus (ESB) is currently the most promising approach for business application integration in distributed and heterogeneous environments. It allows to deploy a service-oriented architecture (SOA) by the integration of all the isolated applications on a decentralized platform.Several commercial or open source ESB-based solutions have been proposed. However, to the best of our knowledge, none of these solutions has integrated the parallel processing. The integration of parallelism in the treatment allows to take advantage of the multicore/multiprocessor technologies and thus can improve greatly the ESB performance. However, this integration is difficult to achieve, and poses problems at multiple levels (communication, synchronization, etc). In this study, we present a new massively parallel ESB architecture that meets this challenge
APA, Harvard, Vancouver, ISO, and other styles
23

Benosman, Ridha Mohammed. "Conception et évaluation de performance d'un Bus applicatif, massivement parallèle et orienté service." Electronic Thesis or Diss., Paris, CNAM, 2013. http://www.theses.fr/2013CNAM0889.

Full text
Abstract:
Enterprise Service Bus (ESB) est actuellement l'approche la plus prometteuse pour l'implémentation d'une architecture orientée services (SOA : Service-Oriented Architecture) par l'intégration des différentes applications isolées dans une plateforme centralisée. De nombreuses solutions d'intégration à base d'ESB on été proposées, elles sont soit open-source comme : Mule, Petals, ou encore Fuse, soit propriétaires tels que : Sonic ESB, IBM WebSphere Message Broker, ou Oracle ESB. Cependant, il n'en existe aucune en mesure de traiter, à la fois des aspects : d'intégration et de traitement massivement parallèle, du moins à notre connaissance. L'intégration du parallélisme dans le traitement est un moyen de tirer profit des technologies multicœurs/multiprocesseurs qui améliorent considérablement les performances des ESBs.Toutefois, cette intégration est une démarche complexe et soulève des problèmes à plusieurs niveaux : communication, synchronisation, partage de données, etc.Dans cette thèse, nous présentons l'étude d'une nouvelle architecture massivement parallèle de type ESB<br>Enterprise service bus (ESB) is currently the most promising approach for business application integration in distributed and heterogeneous environments. It allows to deploy a service-oriented architecture (SOA) by the integration of all the isolated applications on a decentralized platform.Several commercial or open source ESB-based solutions have been proposed. However, to the best of our knowledge, none of these solutions has integrated the parallel processing. The integration of parallelism in the treatment allows to take advantage of the multicore/multiprocessor technologies and thus can improve greatly the ESB performance. However, this integration is difficult to achieve, and poses problems at multiple levels (communication, synchronization, etc). In this study, we present a new massively parallel ESB architecture that meets this challenge
APA, Harvard, Vancouver, ISO, and other styles
24

Chen, Yuxin. "Massively Parallel Dimension Independent Adaptive Metropolis." Thesis, 2015. http://hdl.handle.net/10754/552902.

Full text
Abstract:
This work considers black-box Bayesian inference over high-dimensional parameter spaces. The well-known and widely respected adaptive Metropolis (AM) algorithm is extended herein to asymptotically scale uniformly with respect to the underlying parameter dimension, by respecting the variance, for Gaussian targets. The result- ing algorithm, referred to as the dimension-independent adaptive Metropolis (DIAM) algorithm, also shows improved performance with respect to adaptive Metropolis on non-Gaussian targets. This algorithm is further improved, and the possibility of probing high-dimensional targets is enabled, via GPU-accelerated numerical libraries and periodically synchronized concurrent chains (justified a posteriori). Asymptoti- cally in dimension, this massively parallel dimension-independent adaptive Metropolis (MPDIAM) GPU implementation exhibits a factor of four improvement versus the CPU-based Intel MKL version alone, which is itself already a factor of three improve- ment versus the serial version. The scaling to multiple CPUs and GPUs exhibits a form of strong scaling in terms of the time necessary to reach a certain convergence criterion, through a combination of longer time per sample batch (weak scaling) and yet fewer necessary samples to convergence. This is illustrated by e ciently sampling from several Gaussian and non-Gaussian targets for dimension d 1000.
APA, Harvard, Vancouver, ISO, and other styles
25

VELLA, FLAVIO. "Graph analytics on modern massively parallel systems." Doctoral thesis, 2017. http://hdl.handle.net/11573/1068940.

Full text
Abstract:
Graphs provide a very flexible abstraction for understanding and modeling complex systems in many fields such as physics, biology, neuroscience, engineering, and social science. Only in the last two decades, with the advent of Big Data era, supercomputers equipped by accelerators –i.e., Graphics Processing Unit (GPUs)–, advanced networking, and highly parallel file systems have been used to analyze graph properties such as reachability, diameter, connected components, centrality, and clustering coefficient. Today graphs of interest may be composed by millions, sometimes billions, of nodes and edges and exhibit a highly irregular structure. As a consequence, the design of efficient and scalable graph algorithms is an extraordinary challenge due to irregular communication and memory access patterns, high synchronization costs, and lack of data locality. In the present dissertation, we start off with a brief and gentle introduction for the reader to graph analytics and massively parallel systems. In particular, we present the intersection between graph analytics and parallel architectures in the current state-of-the-art and discuss the challenges encountered when solving such problems on large-scale graphs on these architectures (Chapter 1). In Chapter 2, some preliminary definitions and graph-theoretical notions are provided together with a description of the synthetic graphs used in the literature to model real-world networks. In Chapters 3-5, we present and tackle three different relevant problems in graph analysis: reachability (Chapter 3), Betweenness Centrality (Chapter 4), and clustering coefficient (Chapter 5). In detail, Chapter 3 tackles reachability problems by providing two scalable algorithms and implementations which efficiently solve st-connectivity problems on very large-scale graphs Chapter 4 considers the problem of identifying most relevant nodes in a network which plays a crucial role in several applications, including transportation and communication networks, social network analysis, and biological networks. In particular, we focus on a well-known centrality metrics, namely Betweenness Centrality (BC), and present two different distributed algorithms for the BC computation on unweighted and weighted graphs. For unweighted graphs, we present a new communication-efficient algorithm based on the combination of bi-dimensional (2D) decomposition and multi-level parallelism. Furthermore, new algorithms which exploit the underlying graph topology to reduce the time and space usage of betweenness centrality computations are described as well. Concerning weighted graphs, we provide a scalable algorithm based on an algebraic formulation of the problem. Finally, thorough comprehensive experimental results on synthetic and real- world large-scale graphs, we show that the proposed techniques are effective in practice and achieve significant speedups against state-of-the-art solutions. Chapter 5 considers clustering coefficients problem. Similarly to Betweenness Centrality, it is a fundamental tool in network analysis, as it specifically measures how nodes tend to cluster together in a network. In the chapter, we first extend caching techniques to Remote Memory Access (RMA) operations on distributed-memory system. The caching layer is mainly designed to avoid inter-node communications in order to achieve similar benefits for irregular applications as communication-avoiding algorithms. We also show how cached RMA is able to improve the performance of a new distributed asynchronous algorithm for the computation of local clustering coefficients. Finally, Chapter 6 contains a brief summary of the key contributions described in the dissertation and presents potential future directions of the work.
APA, Harvard, Vancouver, ISO, and other styles
26

"Non-manifold solid modeling on a massively parallel computer." Chinese University of Hong Kong, 1994. http://library.cuhk.edu.hk/record=b5888533.

Full text
Abstract:
Kan Yeuk Ming.<br>Thesis (M.Phil.)--Chinese University of Hong Kong, 1994.<br>Chapter 1. --- INTRODUCTION --- p.1<br>Chapter 1.1 --- Motivation --- p.1<br>Chapter 1.2 --- Objectives --- p.2<br>Chapter 1.3 --- Report Organization --- p.3<br>Chapter 2. --- RETROSPECT OF NON-MANIFOLD SOLID MODELING --- p.5<br>Chapter 2.1 --- Geometric Modeling --- p.5<br>Chapter 2.2 --- Euclidean Space and Topological Space --- p.6<br>Chapter 2.3 --- Domains of Solid and Non-Manifold Geometric Modeling --- p.8<br>Chapter 2.3.1 --- r-set Domain --- p.8<br>Chapter 2.3.2 --- Manifold Domain --- p.9<br>Chapter 2.3.3 --- Adjacency Form of Topology --- p.11<br>Chapter 2.3.4 --- Cell Complex --- p.13<br>Chapter 2.4 --- Representation Schemes of Solid and Non-Manifold Geometric Modeling --- p.14<br>Chapter 2.4.1 --- Spatial Decomposition --- p.14<br>Chapter 2.4.2 --- Constructive Solid Geometry (CSG) --- p.15<br>Chapter 2.4.3 --- Boundary Representations (B-rep) --- p.17<br>Chapter 2.5 --- Summary --- p.20<br>Chapter 3. --- BOOSTING UP THE SPEED OF BOOLEAN OPERATIONS --- p.21<br>Chapter 3.1 --- Solid Modeling with Specialized Hardware --- p.22<br>Chapter 3.1.1 --- Modeling with a 4x4 Determinant Processor --- p.22<br>Chapter 3.1.2 --- Ray Casting Engine --- p.24<br>Chapter 3.2 --- Solid Modeling with General Purposed Parallel Computer --- p.25<br>Chapter 3.2.1 --- Modeling with Shared Memory Parallel Computer --- p.27<br>Chapter 3.2.2 --- Modeling with SIMD Massively Parallel Computer --- p.27<br>Chapter 3.2.3 --- Modeling with MIMD Distributed Memory Parallel Computer --- p.30<br>Chapter 3.3 --- Summary --- p.33<br>Chapter 4. --- OVERVIEW OF DECmpp 12000/Sx/8K --- p.34<br>Chapter 4.1 --- System Architecture --- p.34<br>Chapter 4.1.1 --- DECmpp Sx Front End --- p.34<br>Chapter 4.1.2 --- DECmpp Sx Data Parallel Unit --- p.35<br>Chapter 4.1.2.1 --- Array Control Unit --- p.35<br>Chapter 4.1.2.2 --- Processor Element Array --- p.35<br>Chapter 4.1.2.3 --- Processor Element Communication Mechanism --- p.36<br>Chapter 4.2 --- DECmpp Sx Programming Language --- p.37<br>Chapter 4.2.1 --- Variable Declarations --- p.37<br>Chapter 4.2.2 --- Plural Pointers --- p.38<br>Chapter 4.2.3 --- Processor Selection by Conditional Expressions --- p.39<br>Chapter 4.2.4 --- Processor Element Communications --- p.39<br>Chapter 4.3 --- Summary --- p.40<br>Chapter 5. --- ARCHITECTURE OF THE NON-MANIFOLD GEOMETRIC MODELER --- p.41<br>Chapter 6. --- SEQUENTIAL MODELER --- p.43<br>Chapter 6.1 --- Sequential Half-Wedge structures (SHW) --- p.43<br>Chapter 6.2 --- Incremental Topological Operators --- p.51<br>Chapter 6.3 --- Sequential Boolean Operations --- p.58<br>Chapter 6.3.1 --- Complementing the subtracted model --- p.59<br>Chapter 6.3.2 --- Computing intersection of geometric entities --- p.59<br>Chapter 6.3.3 --- Construction of sub-faces --- p.53<br>Chapter 6.3.4 --- Extraction of resultant topological entities --- p.64<br>Chapter 6.4 --- Summary --- p.67<br>Chapter 7. --- PARALLEL MODELER --- p.68<br>Chapter 7.1 --- Parallel Half-Wedge Structure (PHW) --- p.68<br>Chapter 7.1.1 --- Pmodel structure --- p.69<br>Chapter 7.1.1.1 --- Phwedge structure --- p.69<br>Chapter 7.1.1.2 --- Psurface structure --- p.71<br>Chapter 7.1.1.3 --- Pedge structure --- p.72<br>Chapter 7.1.2 --- Pmav structure --- p.73<br>Chapter 7.2 --- Parallel Boolean Operations --- p.74<br>Chapter 7.2.1 --- Complementing the subtracted model --- p.75<br>Chapter 7.2.2 --- Intersection computation --- p.79<br>Chapter 7.2.2.1 --- Distributing geometric entities --- p.80<br>Chapter 7.2.2.2 --- Vertex-Vertex intersection --- p.89<br>Chapter 7.2.2.3 --- Vertex-Edge intersection --- p.89<br>Chapter 7.2.2.4 --- Edge-Edge intersection --- p.89<br>Chapter 7.2.2.5 --- Vertex-Face intersection --- p.90<br>Chapter 7.2.2.6 --- Edge-Face intersection --- p.92<br>Chapter 7.2.2.7 --- Face-Face intersection --- p.93<br>Chapter 7.2.3 --- Constructing sub-faces --- p.98<br>Chapter 7.2.4 --- Extraction and construction of resultant topological entities --- p.100<br>Chapter 7.3 --- Summary --- p.106<br>Chapter 8. --- THE PERFORMANCE OF PARALLEL HALF-WEDGE MODELER --- p.108<br>Chapter 8.1 --- The performance of converting sequential to parallel structure --- p.111<br>Chapter 8.2 --- The overall performance of parallel Boolean operations --- p.112<br>Chapter 8.3 --- The percentage of execution time for individual stages of parallel Boolean operations --- p.119<br>Chapter 8.4 --- The effect of inbalance loading to the performance of parallel Boolean operations --- p.121<br>Chapter 8.5 --- Summary --- p.125<br>Chapter 9. --- CONCLUSIONS AND SUGGESTIONS FOR FURTHER WORK --- p.126<br>Chapter 9.1 --- Conclusions --- p.126<br>Chapter 9.2 --- Suggestions for further work --- p.127<br>APPENDIX<br>Chapter A. --- SEQUENTIAL HALF-WEDGE STRUCTURE --- p.A-1<br>Chapter B. --- COMPUTATION SCHEME IN CHECKING A FACE LOCATING INSIDE THE FACES OF A SOLID --- p.A-3<br>Chapter C. --- ALGORITHM IN FINDING A HALF-WEDGE WITH A DIRECTION CLOSEST FROM A REFERENCE HALF-WEDGE --- p.A-5<br>Chapter D. --- PARALLEL HALF-WEDGE STRUCTURE --- p.A-7<br>REFERENCES --- p.A-10
APA, Harvard, Vancouver, ISO, and other styles
27

Escobar, Ivanauskas Mauricio. "Massively parallel simulator of optical coherence tomography of inhomogeneous media." 2015. http://hdl.handle.net/1993/30371.

Full text
Abstract:
Optical coherence tomography (OCT) imaging is used in an increasing number of biomedical and industrial applications. A massively parallel simulator of OCT of inhomogeneous turbid media, e.g., biological tissue, could be used as a practical tool to expedite and expand the study of the physical phenomena involving such imaging technique, as well as, to design OCT systems with enhanced performance. Our work presents the open-source implementation of this massively parallel simulator of OCT to satisfy the ever-increasing need for prompt computation of OCT signals with accuracy and flexibility. Our Monte Carlo-based simulator uses graphic processing units (GPUs) to accelerate the intensive computation of processing tens of millions of photon packets undergoing a random walk through a sample. It provides computation of both Class I diffusive reflectance due to ballistic and quasi-ballistic scattered photons and Class II diffusive reflectance due to multiple scattered photons. Our implementation was tested by comparing results with previously validated OCT simulators in multilayered and inhomogeneous (arbitrary spatial distributions) turbid media configurations. It models the objects as a tetrahedron-based mesh and implements and advanced importance sampling technique. Our massively parallel simulator of OCT speeds up the simulation of OCT signals by a factor of 40 times when compared to it central processing unit (CPU)-based sequential implementation.
APA, Harvard, Vancouver, ISO, and other styles
28

Asgari, Kamiabad Amirhassan. "Implementing a Preconditioned Iterative Linear Solver Using Massively Parallel Graphics Processing Units." Thesis, 2011. http://hdl.handle.net/1807/27321.

Full text
Abstract:
The research conducted in this thesis provides a robust implementation of a preconditioned iterative linear solver on programmable graphic processing units (GPUs). Solving a large, sparse linear system is the most computationally demanding part of many widely used power system analysis. This thesis presents a detailed study of iterative linear solvers with a focus on Krylov-based methods. Since the ill-conditioned nature of power system matrices typically requires substantial preconditioning to ensure robustness of Krylov-based methods, a polynomial preconditioning technique is also studied in this thesis. Implementation of the Chebyshev polynomial preconditioner and biconjugate gradient solver on a programmable GPU are presented and discussed in detail. Evaluation of the performance of the GPU-based preconditioner and linear solver on a variety of sparse matrices shows significant computational savings relative to a CPU-based implementation of the same preconditioner and commonly used direct methods.
APA, Harvard, Vancouver, ISO, and other styles
29

Wu, Jiann-Shing, and 吳建興. "Automatic Loops Restructuring Techniques for Parallelizing Compilers on Massively Parallel Processing Systems." Thesis, 1995. http://ndltd.ncl.edu.tw/handle/42910995400652305984.

Full text
Abstract:
碩士<br>國立成功大學<br>資訊及電子工程研究所<br>83<br>Massively parallel machines provide supercomputing capability for engineering and scientific applications. They offer significant advantages with cost/performance ratio and scalability. However, they are much more difficult to program than shared memory machines. As a result, programmers must distribute data on processors and manage interprocessor communication. Therefore, it is important that develops a parallelizing compiler to resolve this difficulty. In this thesis, we studied and developed a loop restructuring tool for parallelizing compilers on massively parallel processing system. This tool transforms C programs into semantics- preserved MPL parallel programs such that programmers do not provide any data partition specification such as data distribution and alignment. The prototype of this loop restructuring system can be developed as the basis of parallelizing compilers.
APA, Harvard, Vancouver, ISO, and other styles
30

Yen, Chung-Liang, and 顏仲良. "Design of the Router and Its Routing Algorithms for Massively Parallel Processing Systems." Thesis, 1998. http://ndltd.ncl.edu.tw/handle/59960787674413368773.

Full text
Abstract:
碩士<br>逢甲大學<br>資訊工程學系<br>86<br>In this thesis, we propose two routing algorithms for massively parallelprocessing systems. The first algorithm uses heuristic and adaptive approach to balance communication in network, named HAR algorithm. Because of having more routing resources on network such that the total system performance is improved. Thesecond algorithm is a wormhole routing algorithm that uses redistribution node approach, named RNN algorithm. The RNN algorithm uses two virtual channels toprevent deadlock on torus network. The performance of two algorithms are evaluatedand simulated by software. The results show that our algorithm are more adaptive than the other three algorithms in the literature. Finally, we give the whole design andimplementation of the wormhole router for HAR algorithm. The detail router architecture, design method and circuit are proposed.
APA, Harvard, Vancouver, ISO, and other styles
31

Yan, Zhong-Liang, and 顏仲良. "Design of the Router and Its Routing Algorithms for the Massively Parallel Processing Systems." Thesis, 1998. http://ndltd.ncl.edu.tw/handle/38751885705129549480.

Full text
APA, Harvard, Vancouver, ISO, and other styles
32

徐卿明. "The Implementation Of A Group Communication Mechanism For The Operation System Of A Massively Parallel Processing Computer." Thesis, 1997. http://ndltd.ncl.edu.tw/handle/36083840345665125377.

Full text
Abstract:
碩士<br>逢甲大學<br>資訊工程研究所<br>85<br>A Massively Parallel Processing (MPP) computer system is a computer system supporting large computation capacity. In the distributed-memory MPPcomputer system, the commumication among nodes are through message passing. Therefore, the message-passing InterProcess Communication (IPC) mechanism is a key issue to the performance of MPP computer systems. In addition to the point-to-point communication type group communication, including broadcast and multicast communication types, are also desired in the programs on a MPP computer system. The group communication can conveniently and efficiently support the communication among multiple processes. This thesis is to research the group IPC mechanism for a MPP computer system and implement this mechanism under a distributed environment.   The design issues for a group IPC mechanism include two aspects: the system architecture of group IPC and the group communication mechanism. The issue about system architecture of group IPC considers how to create/manage the groups and group members, and the organization of the group IPC management system. The issue about of communication mechanism of group IPC considers how to transmit messages and the transmission quality of messages. The proposed group IPC communication system has the following features:(1) It is easy to maintain and manage the information about groups and group members. (2) It is location transparent. (3) By using communication port technique, the port technique, the problem of group overlap is solved. (4) By using a hierarchical point-to-point communication structure, a better communication performance can be cotained. (5) A kernel-level library interface for U-Port B-Port and M-Port is provided to both system programs and applications.   We have successively implemented a MPP communication server, called mppnetmsgserver, on Mach microkernel. The mppnetmsgserver extends Mach IPC to network capability and provides broadcast and multicast communications. This server will be useful for programmers of both parallel processing and distributed computing.
APA, Harvard, Vancouver, ISO, and other styles
33

Feng, Zhuo. "Modeling and Analysis of Large-Scale On-Chip Interconnects." 2009. http://hdl.handle.net/1969.1/ETD-TAMU-2009-12-7142.

Full text
Abstract:
As IC technologies scale to the nanometer regime, efficient and accurate modeling and analysis of VLSI systems with billions of transistors and interconnects becomes increasingly critical and difficult. VLSI systems impacted by the increasingly high dimensional process-voltage-temperature (PVT) variations demand much more modeling and analysis efforts than ever before, while the analysis of large scale on-chip interconnects that requires solving tens of millions of unknowns imposes great challenges in computer aided design areas. This dissertation presents new methodologies for addressing the above two important challenging issues for large scale on-chip interconnect modeling and analysis: In the past, the standard statistical circuit modeling techniques usually employ principal component analysis (PCA) and its variants to reduce the parameter dimensionality. Although widely adopted, these techniques can be very limited since parameter dimension reduction is achieved by merely considering the statistical distributions of the controlling parameters but neglecting the important correspondence between these parameters and the circuit performances (responses) under modeling. This dissertation presents a variety of performance-oriented parameter dimension reduction methods that can lead to more than one order of magnitude parameter reduction for a variety of VLSI circuit modeling and analysis problems. The sheer size of present day power/ground distribution networks makes their analysis and verification tasks extremely runtime and memory inefficient, and at the same time, limits the extent to which these networks can be optimized. Given today?s commodity graphics processing units (GPUs) that can deliver more than 500 GFlops (Flops: floating point operations per second). computing power and 100GB/s memory bandwidth, which are more than 10X greater than offered by modern day general-purpose quad-core microprocessors, it is very desirable to convert the impressive GPU computing power to usable design automation tools for VLSI verification. In this dissertation, for the first time, we show how to exploit recent massively parallel single-instruction multiple-thread (SIMT) based graphics processing unit (GPU) platforms to tackle power grid analysis with very promising performance. Our GPU based network analyzer is capable of solving tens of millions of power grid nodes in just a few seconds. Additionally, with the above GPU based simulation framework, more challenging three-dimensional full-chip thermal analysis can be solved in a much more efficient way than ever before.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography