Dissertations / Theses: 'CELL Broadband Engine'

1

Ålind, Markus. "A Skeleton library for Cell Broadband Engine." Thesis, Linköping University, Department of Computer and Information Science, 2008. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-54476.

Full text

Abstract:

The Cell Broadband Engine processor is a powerful processor capable of over 220 GFLOPS. It is highly specialized and can be controlled in detail by the programmer. The Cell is significantly more complicated to program than a standard homogeneous multi core processor such as the Intel Core2 Duo and Quad. This thesis explores the possibility to abstract some of the complexities of Cell programming while maintaining high performance. The abstraction is achieved through a library of parallel skeletons implemented in the bulk synchronous parallel programming environment NestStep. The library includes constructs for user defined SIMD optimized data parallel skeletons such as map, reduce and more. The evaluation of the library includes porting of a vector based scientific computation program from sequential C code to the Cell using the library and the NestStep environment. The ported program shows good performance when compared to the sequential original code run on a high-end x86 processor. The evaluation also shows that a dot product implemented with the skeleton library is faster than the dot product in the IBM BLAS library for the Cell processor with more than two slave processors.

APA, Harvard, Vancouver, ISO, and other styles

2

Lundberg, Marcus. "A Parallel Monte Carlo Implementation on the Cell Broadband Engine." Thesis, Uppsala University, Department of Information Technology, 2009. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-108035.

Full text

Abstract:

The Cell Broadband Engine is a heterogeneous multi-core processor architecture thattrades ease-of-programming for high performance. While primarily featured in theSony PlayStation 3 (PS3) for high-end games, it is a promising technology for scientistsworking with computationally heavy numerical methods. This paper presents threeimplementations of a Monte Carlo simulation of a system of charged particles on thePS3. The first method, while easy to implement and use, did not yield anyperformance advantage over conventional x86 processors. The second method ranmore than twice as fast on the PS3 as a comparable code on a 1.86 GHz Intel Xeonmachine but could run only a limited problem size. The third program ran over sixtimes faster than the x86 reference system and could handle any problem up to thesaturation of the PS3 main memory. The final program is also suitable for a cluster ofPlayStations and is easily adaptable to work on a distributed computing framework.

APA, Harvard, Vancouver, ISO, and other styles

3

Rajamohan, Srijith Datta Suman Narayanan Vijaykrishnan. "A neural network based classifier on the cell broadband engine." [University Park, Pa.] : Pennsylvania State University, 2009. http://etda.libraries.psu.edu/theses/approved/WorldWideIndex/ETD-4512/index.html.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Lopes, André Filipe da Rocha. "tlCell: a software transactional memory for the cell broadband engine architecture." Master's thesis, Faculdade de Cencias e Tecnologia, 2010. http://hdl.handle.net/10362/4110.

Full text

Abstract:

Dissertação apresentada na Faculdade de Ciências e Tecnologia da Universidade Nova de Lisboa para a obtenção do Grau de Mestre em Engenharia Informática
Os computadores evoluíram exponencialmente na ultima década. A performance tem sido o principal objectivo resultando no aumento do frequência dos processadores, situação que já não é fazível devido ao consumo de energia exagerado dos processadores actuais. A arquitectura Cell Broadband Engine começou com o objectivo de providenciar alta capacidade computacional com um baixo consumo energético. O resultado é uma arquitectura com multiprocessadores heterogéneos e uma distribuição de memória única com vista a alto desempenho e redução da complexidade do hardware para reduzir o custo de produção. Espera-se que as técnicas de concorrência e paralelismo aumentem a performance desta arquitectura, no entanto as soluções de alto desempenho apresentadas s˜ao sempre muito especificas e devido à sua arquitectura e distribuição de memória inovadora ´e ainda difícil apresentar ferramentas passíveis de explorar concorrência e paralelismo como um camada de abstracção. Memória Transaccional por Software é um modelo de programação que propõe este nível de abstracção e tem vindo a ganhar popularidade existindo já variadas implementações com performance perto de soluções específicas de grão fino. A possibilidade de usar Memória Transaccional por Software nesta arquitectura inovadora, desenvolvendo uma ferramenta capaz de abstrair o programador da consistência e gestão de memória é apelativo. Neste documento especifica-se uma plataforma deffered-update de Memória Transactional por Software para a arquitectura Cell Broadband Engine que tira partido da capacidade computacional dos Synergistic Processing Elements (SPEs) usando locks em commit-time. São propostos dois modelos diferentes, fully local e multi-buffered de forma a poder estudar as implicações das escolhas feitas no desenho da plataforma.

APA, Harvard, Vancouver, ISO, and other styles

5

Azuelos, Nathaniel. "An integrated functional solution for multi-core programming on the cell broadband engine." Thesis, McGill University, 2009. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=32276.

Full text

Abstract:

Recent efforts in microprocessor development tend to the coexistence of several Central Processing Units (CPUs) on a single chip. The Cell Broadband Engine (CBE), the fruit of collaboration between Sony, Toshiba and IBM, integrates IBM's legacy PowerPC CPU with a new set of simple cores, all of which communicate through a high speed bus. The multiple cores on the CBE allow users to exploit the parallel nature of their programs. However, it is often difficult to effciently extract the parallelism from an application and to distribute tasks in a suitable fashion. We propose a dataﬂow approach to CBE computing where the compiler is in charge of task partitioning and of the infrastructure for runtime distribution of tasks. In this work, we present the NCC programming language, Squid compiler and runtime environment. NCC is a strict functional dataﬂow language that forces explicit variable dependencies, in order to exploit parallelism in the application. NCC code is thus written by the user without specifying parallelism explicitly. The Squid Compiler draws a virtual data ﬂow graph from the NCC source. This graph is then partitionned according to implementation specific criteria into tasks and supertasks. The individual tasks are then translated to ANSI-C, and supertasks are analyzed and transformed into scheduling structures. All tasks are executed by the CBE's simple cores. The Squid Runtime Environment (SRE) interacts with the generated scheduler to order tasks' execution, running the supertasks' scheduling, and managing garbage collection. The SRE runs on the CBE's PowerPC core as a separate thread to implement a host-device paradigm, and as resident code on the s
Les récents efforts en développement de microprocesseurs tendent à une coexistence entre plusieurs Unités Centrales (UC) sur une seule puce. Le Cell Broadband Engine (CBE), le fruit d'une collaboration entre Sony, Toshiba et IBM, intègre le CU patrimonial d'IBM PowerPC, avec un nouvel ensemble d'unités simples, communiquant entre elles avec un bus de haute vitesse. Les nombreueses unités présentes dans le CBE permettent aux utilisateurs d'exploiter la nature parallèle de leurs programmes. Cependant, il est souvent difficile d'extraire le parallélisme d'une application et de distribuer des tâches de façon appropriée. Nous proposons donc d'approcher la programmation du CBE sous une perspective de ﬂux de données où le compilateur est chargé de partitionner les tâches et de l'infrastructure de la distribution des tâches. Dans ce travail, nous présentons la langue de programmation NCC, le compilateur et l'environnement d'exécution Squid. NCC est un langage fonctionnel stricte de ﬂux, qui force les entre variables à être explicites, aﬁn d'exploiter le parallelisme d'une application. Le code NCC est donc rédigé par l'utilsateur sans spéciﬁer le parallelisme explicitement. Le compilateur Squid dessine un graphe de ﬂux de données virtuel issu du code NCC. Ce graphe est partitionné selon des critères particuliers à l'implémentation en tâches et supertâches. Chaque tâche est ensuite traduite en ANSI-C, et les supertâches sont analysées et transformées en structures d'ordonnançement. Toutes les tâches sont exécutées par les untiés simples du CBE. L'Environnement d'Exécution Squid (EES) interagit avec l'ordonnanceur pour ordonner$

APA, Harvard, Vancouver, ISO, and other styles

6

Aji, Ashwin Mandayam. "Exploiting Multigrain Parallelism in Pairwise Sequence Search on Emergent CMP Architectures." Thesis, Virginia Tech, 2008. http://hdl.handle.net/10919/33606.

Full text

Abstract:

With the emerging hybrid multi-core and many-core compute platforms delivering unprecedented high performance within a single chip, and making rapid strides toward the commodity processor market, they are widely expected to replace the multi-core processors in the existing High-Performance Computing (HPC) infrastructures, such as large scale clusters, grids and supercomputers. On the other hand in the realm of bioinformatics, the size of genomic databases is doubling every 12 months, and hence the need for novel approaches to parallelize sequence search algorithms has become increasingly important. This thesis puts a significant step forward in bridging the gap between software and hardware by presenting an efficient and scalable model to accelerate one of the popular sequence alignment algorithms by exploiting multigrain parallelism that is exposed by the emerging multiprocessor architectures. Specifically, we parallelize a dynamic programming algorithm called Smith-Waterman both within and across multiple Cell Broadband Engines and within an nVIDIA GeForce General Purpose Graphics Processing Unit (GPGPU). Cell Broadband Engine: We parallelize the Smith-Waterman algorithm within a Cell node by performing a blocked data decomposition of the dynamic programming matrix followed by pipelined execution of the blocks across the synergistic processing elements (SPEs) of the Cell. We also introduce novel optimization methods that completely utilize the vector processing power of the SPE. As a result, we achieve near-linear scalability or near-constant efficiency for up to 16 SPEs on the dual-Cell QS20 blades, and our design is highly scalable to more cores, if available. We further extend this design to accelerate the Smith-Waterman algorithm across nodes on both the IBM QS20 and the PlayStation3 Cell cluster platforms and achieve a maximum speedup of 44, when compared to the execution times on a single Cell node. We then introduce an analytical model to accurately estimate the execution times of parallel sequence alignments and wavefront algorithms in general on the Cell cluster platforms. Lastly, we contribute and evaluate TOSS -- a Throughput-Oriented Sequence Scheduler, which leverages the performance prediction model and dynamically partitions the available processing elements to simultaneously align multiple sequences. This scheme succeeds in aligning more sequences per unit time with an improvement of 33.5% over the naive first-come, first-serve (FCFS) scheduler. nVIDIA GPGPU: We parallelize the Smith-Waterman algorithm on the GPGPU by optimizing the code in stages, which include optimal data layout strategies, coalesced memory accesses and blocked data decomposition techniques. Results show that our methods provide a maximum speedup of 3.6 on the nVIDIA GPGPU when compared to the performance of the naive implementation of Smith-Waterman.
Master of Science

APA, Harvard, Vancouver, ISO, and other styles

7

Cox, Guilherme Mota Cavalcanti de Albuquerque. "Implementação de Visualização de Dados Tridimensionais de Malhas Irregulares no Processador Cell Broadband Engine." Universidade do Estado do Rio de Janeiro, 2009. http://www.bdtd.uerj.br/tde_busca/arquivo.php?codArquivo=8269.

Full text

Abstract:

Direct volume rendering has become a popular technique for visualizing volumetric data from sources such as scientific simulations, analytic functions, and medical scanners, among others. Volume rendering algorithms, such as raycasting, can produce high-quality images, however, the use of raycasting has been limited due to its high demands on computational power and memory bandwidth. In this paper, we propose a new implementation of the raycasting algorithm that takes advantage of the highly parallel architecture of the Cell Broadband Engine processor, with 9 heterogeneous cores, in order to allow interactive raycasting of irregular datasets. All the computational power of the Cell BE processor, though, comes at the cost of a different programming model. Applications need to be rewritten in order to explore the full potential of the Cell processor, which requires using multithreading and vectorized code. In our approach, we tackle this problem by distributing ray computations using the visible faces, and vectorizing the lighting integral operations inside each core. Our experimental results show that we can obtain good speedups reducing the overall rendering time significantly.
A renderização de volume direta tornou-se uma técnica popular para visualização volumétrica de dados extraídos de fontes como simulações científicas, funções analíticas, scanners médicos, entre outras. Algoritmos de renderização de volume, como o raycasting, produzem imagens de alta qualidade. O seu uso, contudo, é limitado devido à alta demanda de processamento computacional e o alto uso de memória. Nesse trabalho, propomos uma nova implementação do algoritmo de raycasting que aproveita a arquitetura altamente paralela do processador Cell Broadband Engine, com seus 9 núcleos heterogêneos, que permitem renderização eficiente em malhas irregulares de dados. O poder computacional do processador Cell BE demanda um modelo de programação diferente. Aplicações precisam ser reescritas para explorar o potencial completo do processador Cell, que requer o uso de multithreading e código vetorizado. Em nossa abordagem, enfrentamos esse problema distribuindo a computação de cada raio incidente nas faces visíveis do volume entre os núcleos do processador, e vetorizando as operações da integral de iluminação em cada um. Os resultados experimentais mostram que podemos obter bons speedups reduzindo o tempo total de renderização de forma significativa.

APA, Harvard, Vancouver, ISO, and other styles

8

Li, Yi-Hsien. "Real-Time Space-Time Adaptive Processing on the STI CELL Multiprocessor." Thesis, Linköping University, Department of Electrical Engineering, 2007. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-8933.

Full text

Abstract:

Space-Time Adaptive Processing (STAP) has been widely used in modern radar systems such as Ground Moving Target Indication (GMTI) systems in order to suppress jamming and interference. However, the high performance comes at a price of higher computational complexity, which requires extensive powerful hardware.

The new STI Cell Broadband Engine (CBE) processor combines PowerPC core augmented with eight streamlined high-performance SIMD processing engine offers an opportunity to implement the STAP baseband signal processing without any full custom hardware. This paper presents the implementation of an STAP baseband signal processing flow on the state-of-the-art STI CELL multiprocessor, which enables the concept of Software-Defined Radar (SDR). The potential of the Cell BE processor is studied so that kernel subroutine such as QR decomposition, Fast Fourier Transform (FFT), and FIR filtering of STAP are mapped to the SPE co-processors of Cell BE processor with variety of architectural specific optimization techniques.

This report starts with an overview of airborne radar technique and then the standard, specifically the third-order Doppler-factored STAP are introduced. Next, it goes with the thorough description of Cell BE architecture, its programming tool chain and parallel programming methods for Cell BE. In later chapter, how the STAP is implemented on the Cell BE processor is discussed and the simulation results are presented. Furthermore, based on the result of earlier benchmarking, an optimized task partition and scheduling method is proposed to improve the overall performance.

APA, Harvard, Vancouver, ISO, and other styles

9

Schmuland, Todd E. "Exploiting Parallel Processing Techniques for Implementation of Wideband MUSIC Algorithm on the IBM Cell Broadband Engine Processor." University of Toledo / OhioLINK, 2010. http://rave.ohiolink.edu/etdc/view?acc_num=toledo1271273869.

Full text

APA, Harvard, Vancouver, ISO, and other styles

10

Jakobsson, Teodor. "Parallelization of Animation Blending on the PlayStation®3." Thesis, Linköpings universitet, Informationskodning, 2012. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-79409.

Full text

Abstract:

An animation system gives a dynamic and life-like feel to character motions, allowing motion behaviour that far transcends the mere spatial translations of classic computer games. This increase in behavioural complexity however does not come for free as animation systems often are haunted by considerable performance overhead, the extent of which reflecting the complexity of the desired system. In game development performance optimization is key, the pursuit of which is aided by the static hardware configuration of modern gaming consoles. These allow extensive optimization through specializing the application, at whole or in part, to the underlying hardware architecture. In this master's theses a method, that efficiently utilizes the parallel architecture of the PlayStation®3, is proposed in order to migrate the process of animation evaluation and blending from a single-thread implementation on the main processor to a fully parallelized multi-thread solution on the associated coprocessors. This method is further complimented with an in-depth study of the underlying theoretical foundations, as well as a reflection on similar works and approaches as used by other contemporary game development companies.

APA, Harvard, Vancouver, ISO, and other styles

11

Zhang, Zikai. "Hardware acceleration on IBM cell broadband engine for simulation of coupled interconnects using waveform relaxation and transverse partitioning." Thesis, McGill University, 2009. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=32420.

Full text

Abstract:

Abstract Over the past few years, the trend in microprocessor design has shifted from increasing the clock frequencies to multi-core designs that embed multiple processing cores on the same chip. This has meant that we can no longer rely on increasing clock frequencies in order to improve the performance of electronic design automation (EDA) tools. In fact, for these tools to take advantage of modern advances in microprocessor design they must be adapted to take advantage of parallel computing architectures. In this thesis we parallelize and implement an algorithm on the IBM Cell Broadband Engine (Cell BE), which is based on the techniques of waveform relaxation and transverse partitioning to efficiently simulate large coupled interconnect circuits at high speed. Several strategies are used in the Cell BE programs to achieve high performance. The Cell BE processor achieves the best performance with a speed-up of 10x when the number of transmission lines is a multiple of the maximum number of Synergistic Processor Elements (SPEs) that are running concurrently.
Résumé Au cours des dernières années, la tendance dans la conception des microprocesseurs est passée de l'augmentation de la fréquence d'horloge à des modèles multi-core qui intègrent de multiples noyaux de traitement sur la même puce. Cela signifie que nous ne pouvons plus compter sur l'augmentation des fréquences d'horloge dans le but d'améliorer les performances des outils d'automatisation de conception électronique (EDA). En fait, pour prendre avantage des progrès réalisés dans la conception de microprocesseurs, ces outils doivent être adaptés afin d'utiliser des architectures de calcul parallèle. Dans cette thèse nous avons paralléliser et de mis en oeuvre un algorithme d'IBM sur le Cell Broadband Engine (Cell BE), qui est basée sur les techniques de relaxation d'onde et de partition transversale pour simuler de manière efficace des circuits d'interconnection couplés à haute vitesse. Plusieurs stratégies sont utilisées dans le Cell BE programs pour atteindre la haute performance. Le processeur Cell BE réalise la meilleure performance avec une vitesse de 10x lorsque le nombre de lignes de transmission est un multiple du nombre maximum d'éléments synergiques du processeur (SPEs) qui sont en cours d'exécution simultanément.

APA, Harvard, Vancouver, ISO, and other styles

12

Paiva, Pedro Emanuel Pinto de. "Utilização do processador Cell para o processamento de dados obtidos por tomografia aplicada a materiais compósitos." Master's thesis, Faculdade de Ciências e Tecnologia, 2011. http://hdl.handle.net/10362/6095.

Full text

Abstract:

Dissertação de Mestrado em Engenharia Informática
Os materiais compósitos, em que numa base (matriz) se dispersam partículas (reforços), são muito usados em várias áreas como a aeronáutica. Quando os engenheiros de Materiais testam novas formas de fabricar estes materiais, usam dados obtidos em tomógrafos de raios X para caracterizar a população de reforços. Os dados gerados pelos tomógrafos exigem grandes capacidades de processamento, não só pelo seu volume (da ordem de 1 Gbyte) como pela complexidade computacional de alguns algoritmos. É possível reduzir os tempos de execução de algumas fases de processamento de dados tomográficos fazendo a paralelização dos algoritmos correspondentes. Em trabalhos anteriores,foram usados multiprocessadores de memória distribuída e de memória partilhada como plataforma de execução dessas versões dos algoritmos. O Cell Broadband Engine (Cell BE) é multi-processador heterogéneo desenhado para oferecer uma elevada capacidade de processamento com mais eficiência energética do que os CPUs convencionais. Estas características tornam fazem com que o Cell BE seja muito utilizado no desenvolvimento de programas para a Ciência e Engenharia Computacionais. Nesta tese, são desenvolvidas versões de algumas operações de processamento de dados tomográficos vocacionadas para o Cell/BE. O Cell BE é um multiprocessador heterogéneo onde no mesmo chip coexistem um processador convencional (PPU), 8 processadores especializados em “number crunching” (SPUs) e um bus de interligação. Alguns autores chamam ao Cell BE um “cluster num chip”, para frisar que existe um conjunto de espaços de endereçamento,obrigando a que o programador ou o ambiente de execução façam a gestão explícita das transferências de dados entre as várias partes de memória. Esta organização sugere que, para construir versões paralelas dos algoritmos de processamento, se considerem estratégias de paralelização geométrica semelhantes às que se utilizaram num cluster de máquinas convencionais. A experiência mostrou que a escassa memória local existente nos SPUs obriga a que esta estratégia tenha de ser complementada por outras. Apesar destas limitações, a tese mostra que, no Cell BE se conseguem reduções significativas dos tempos de execução de alguns algoritmos de processamento de dados tomográficas, mesmo em relação a trabalhos anteriores em que foram usados multiprocessadores convencionais.

APA, Harvard, Vancouver, ISO, and other styles

13

SHI, YU. "Enhanced SAR Image Processing Using A Heterogeneous Multiprocessor." Thesis, Linköping University, Department of Computer and Information Science, 2008. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-11517.

Full text

Abstract:

Synthetic antenna aperture (SAR) is a pulses focusing airborne radar which can achieve high resolution radar image. A number of image process algorithms have been developed for this kind of radar, but the calculation burden is still heavy. So the image processing of SAR is normally performed “off-line”.

The Fast Factorized Back Projection (FFBP) algorithm is considered as a computationally efficient algorithm for image formation in SAR, and several applications have been implemented which try to make the process “on-line”.

CELL Broadband Engine is one of the newest multi-core-processor jointly developed by Sony, Toshiba and IBM. CELL is good at parallel computation and floating point numbers, which all fit the demands of SAR image formation.

This thesis is going to implement FFBP algorithm on CELL Broadband Engine, and compare the results with pre-projects. In this project, we try to make it possible to perform SAR image formation in real-time.

APA, Harvard, Vancouver, ISO, and other styles

14

"A Tiger Compiler for the Cell Broadband Engine Architecture." Thesis, 2013. http://hdl.handle.net/10388/ETD-2013-08-1238.

Full text

Abstract:

The modern computing industry tends to build integrated circuits with multiple energy-efficient cores instead of ramping up the clock speed for each single processing unit. While each core may not run as fast as the single core model, such architecture allows more jobs to be handled in parallel and also provides better overall performance. Asymmetric Multiprocessing, also known as Heterogeneous Multiprocessing, involves multiple processors that differ architecturally from one another, especially where each processor has its own memory space. Under power limitations, this design could provide better performance than that attained through symmetric multiprocessing. However, the heterogeneous nature adds difficulty to programming. Each specific architecture requires its own program code. Programmers also need to explicitly transfer code and data between processors. This study describes the implementation of a compiler of the pedagogic Tiger language for the Cell Broadband Engine, an asymmetric multiprocessing platform jointly developed by Sony, Toshiba and IBM. The problem above is solved by introducing multiple backends for the Tiger language, along with a remote call stub (RCS) generator. Functions are compiled into different architectures, and calls across architectures are linked automatically through the stubs. RCS takes care of the execution context switch and hides details of the argument data/return value transfer. TigC simplifies the programming and building procedures. It also provides a high-level view of the whole program execution for future optimization because all of the source files are processed by a single compiler. As an example of this procedure, the possible optimization of data transfer during remote calls is investigated here.

APA, Harvard, Vancouver, ISO, and other styles

15

Johnson, Jacob Raghavan Padma. "Power efficiency and scaling of the cell broadband engine." 2009. http://etda.libraries.psu.edu/theses/approved/WorldWideIndex/ETD-3966/index.html.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Shaffer, Andrew P. Raghavan Padma. "Pfftc an improved fast fourier transform for the ibm cell broadband engine /." 2009. http://etda.libraries.psu.edu/theses/approved/WorldWideIndex/ETD-4024/index.html.

Full text

APA, Harvard, Vancouver, ISO, and other styles

17

Chien, Jung-Yin, and 簡榮胤. "A Development Environment of Dataflow Programming Model with Application to IBM Cell Broadband Engine." Thesis, 2009. http://ndltd.ncl.edu.tw/handle/56951943052958810030.

Full text

Abstract:

碩士
國立成功大學
資訊工程學系碩博士班
97
Multicore processor provides large computation capability but also involves the complicate parallel programming. One of major considerations in parallel programming is the performance. Traditional design methodologies which start a design on a selected platform usually spend a lot of effort and time on tuning performance and debugging. When platform is changed, the entire design flow may have to be repeated and very time-consuming. Hence a flexible design methodology is necessary. In this thesis, we present a dataflow design methodology and use it in the programming of Cell processor. The dataflow model provides a high-level abstraction of underlying hardware. Computation and communication of the target application are separated and represented as modules and channels, respectively. To demonstrate the proposed programming model, a MPEG-4 SP decoder is used as an example. The parallelisms of MPEG-4 decoder are discussed and exposed with the dataflow model. To map the high level dataflow model to Cell processor, the mapping flow, including offline profiling, task allocation and runtime libraries, are developed. According to the profiled data, the allocation algorithm could allocate task on multiprocessors as balanced as possible. An efficient synchronization mechanism on Cell processor is also proposed. We also discuss the impact of the models and the mapping flow corresponding to performance about decoding speed. The results show that the proposed methodology gets considerable performance boost when number of cores is increased. It is possible to synthesize the model targeting to either dedicate hardware or software on multiprocessor once the original tool chain of the new platform is modified. For example, the proposed model can be translated into SystemC model to facilitate system level design methodology.

APA, Harvard, Vancouver, ISO, and other styles

18

Girard, Natalie. "CellPilot: An extension of the Pilot library for Cell Broadband Engine processors and heterogeneous clusters." Thesis, 2012. http://hdl.handle.net/10214/3279.

Full text

Abstract:

The CellPilot library provides a uniform communication programming model, based on Pilot's process/channel approach, for clusters of Cell Broadband Engine processors. Pilot, a thin layer on top of the Message Passing Interface (MPI) library, allows processes to read/write messages on channels defined between pairs of processes on the cluster, but Pilot alone does not help a Cell programmer cope with the considerable complexities of intra-Cell communication. With CellPilot, programmers still design software in terms of processes, but they can now be located on a Cell node's Power Processor Elements (PPEs), Synergistic Processing Elements (SPEs), or non-Cell node within a heterogeneous Cell cluster, and communication is accomplished via channels between process pairs. Programs are coded in terms of reading and writing on those channels, whereupon CellPilot transparently applies whichever communication mechanisms are required to transport the message, regardless of its endpoints. This gives the programmer a way to handle inter-process communication while avoiding low-level I/O operations and the use of multiple libraries.

APA, Harvard, Vancouver, ISO, and other styles

19

Xu, Meilian. "Exploiting parallelism of irregular problems and performance evaluation on heterogeneous multi-core architectures." 2012. http://hdl.handle.net/1993/9236.

Full text

Abstract:

In this thesis, we design, develop and implement parallel algorithms for irregular problems on heterogeneous multi-core architectures. Irregular problems exhibit random and unpredictable memory access patterns, poor spatial locality and input dependent control flow. Heterogeneous multi-core processors vary in: clock frequency, power dissipation, programming model (MIMD vs. SIMD), memory design and computing units, scalar versus vector units. The heterogeneity of the processors makes designing efficient parallel algorithms for irregular problems on heterogeneous multicore processors challenging. Techniques of mapping tasks or data on traditional parallel computers can not be used as is on heterogeneous multi-core processors due to the varying hardware. In an attempt to understand the efficiency of futuristic heterogeneous multi-core architectures on applications we study several computation and bandwidth oriented irregular problems on one heterogeneous multi-core architecture, the IBM Cell Broadband Engine (Cell BE). The Cell BE consists of a general processor and eight specialized processors and addresses vector/data-level parallelism and instruction-level parallelism simultaneously. Through these studies on the Cell BE, we provide some discussions and insight on the performance of the applications on heterogeneous multi-core architectures. Verifying these experimental results require some performance modeling. Due to the diversity of heterogeneous multi-core architectures, theoretical performance models used for homogeneous multi-core architectures do not provide accurate results. Therefore, in this thesis we propose an analytical performance prediction model that considers the multitude architectural features of heterogeneous multi-cores (such as DMA transfers, number of instructions and operations, the processor frequency and DMA bandwidth). We show that the execution time from our prediction model is comparable to the execution time of the experimental results for a complex medical imaging application.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'CELL Broadband Engine'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles