Dissertations / Theses: 'Performance Optimization in Software and Hardware'

1

Schöne, Robert, Thomas Ilsche, Mario Bielert, Daniel Molka, and Daniel Hackenberg. "Software Controlled Clock Modulation for Energy Efficiency Optimization on Intel Processors." Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden, 2017. http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-224966.

Full text

Abstract:

Current Intel processors implement a variety of power saving features like frequency scaling and idle states. These mechanisms limit the power draw and thereby decrease the thermal dissipation of the processors. However, they also have an impact on the achievable performance. The various mechanisms significantly differ regarding the amount of power savings, the latency of mode changes, and the associated overhead. In this paper, we describe and closely examine the so-called software controlled clock modulation mechanism for different processor generations. We present results that imply that the available documentation is not always correct and describe when this feature can be used to improve energy efficiency. We additionally compare it against the more popular feature of dynamic voltage and frequency scaling and develop a model to decide which feature should be used to optimize inter-process synchronizations on Intel Haswell-EP processors.

APA, Harvard, Vancouver, ISO, and other styles

2

Vujic, Nikola. "Software caching techniques and hardware optimizations for on-chip local memories." Doctoral thesis, Universitat Politècnica de Catalunya, 2012. http://hdl.handle.net/10803/83598.

Full text

Abstract:

Despite the fact that the most viable L1 memories in processors are caches, on-chip local memories have been a great topic of consideration lately. Local memories are an interesting design option due to their many benefits: less area occupancy, reduced energy consumption and fast and constant access time. These benefits are especially interesting for the design of modern multicore processors since power and latency are important assets in computer architecture today. Also, local memories do not generate coherency traffic which is important for the scalability of the multicore systems. Unfortunately, local memories have not been well accepted in modern processors yet, mainly due to their poor programmability. Systems with on-chip local memories do not have hardware support for transparent data transfers between local and global memories, and thus ease of programming is one of the main impediments for the broad acceptance of those systems. This thesis addresses software and hardware optimizations regarding the programmability, and the usage of the on-chip local memories in the context of both single-core and multicore systems. Software optimizations are related to the software caching techniques. Software cache is a robust approach to provide the user with a transparent view of the memory architecture; but this software approach can suffer from poor performance. In this thesis, we start optimizing traditional software cache by proposing a hierarchical, hybrid software-cache architecture. Afterwards, we develop few optimizations in order to speedup our hybrid software cache as much as possible. As the result of the software optimizations we obtain that our hybrid software cache performs from 4 to 10 times faster than traditional software cache on a set of NAS parallel benchmarks. We do not stop with software caching. We cover some other aspects of the architectures with on-chip local memories, such as the quality of the generated code and its correspondence with the quality of the buffer management in local memories, in order to improve performance of these architectures. Therefore, we run our research till we reach the limit in software and start proposing optimizations on the hardware level. Two hardware proposals are presented in this thesis. One is about relaxing alignment constraints imposed in the architectures with on-chip local memories and the other proposal is about accelerating the management of local memories by providing hardware support for the majority of actions performed in our software cache.
Malgrat les memòries cau encara son el component basic pel disseny del subsistema de memòria, les memòries locals han esdevingut una alternativa degut a les seves característiques pel que fa a l’ocupació d’àrea, el seu consum energètic i el seu rendiment amb un temps d’accés ràpid i constant. Aquestes característiques son d’especial interès quan les properes arquitectures multi-nucli estan limitades pel consum de potencia i la latència del subsistema de memòria.Les memòries locals pateixen de limitacions respecte la complexitat en la seva programació, fet que dificulta la seva introducció en arquitectures multi-nucli, tot i els avantatges esmentats anteriorment. Aquesta tesi presenta un seguit de solucions basades en programari i maquinari específicament dissenyat per resoldre aquestes limitacions.Les optimitzacions del programari estan basades amb tècniques d'emmagatzematge de memòria cau suportades per llibreries especifiques. La memòria cau per programari és un sòlid mètode per proporcionar a l'usuari una visió transparent de l'arquitectura, però aquest enfocament pot patir d'un rendiment deficient. En aquesta tesi, es proposa una estructura jeràrquica i híbrida. Posteriorment, desenvolupem optimitzacions per tal d'accelerar l’execució del programari que suporta el disseny de la memòria cau. Com a resultat de les optimitzacions realitzades, obtenim que el nostre disseny híbrid es comporta de 4 a 10 vegades més ràpid que una implementació tradicional de memòria cau sobre un conjunt d’aplicacions de referencia, com son els “NAS parallel benchmarks”.El treball de tesi inclou altres aspectes de les arquitectures amb memòries locals, com ara la qualitat del codi generat i la seva correspondència amb la qualitat de la gestió de memòria intermèdia en les memòries locals, per tal de millorar el rendiment d'aquestes arquitectures. La tesi desenvolupa propostes basades estrictament en el disseny de nou maquinari per tal de millorar el rendiment de les memòries locals quan ja no es possible realitzar mes optimitzacions en el programari. En particular, la tesi presenta dues propostes de maquinari: una relaxa les restriccions imposades per les memòries locals respecte l’alineament de dades, l’altra introdueix maquinari específic per accelerar les operacions mes usuals sobre les memòries locals.

APA, Harvard, Vancouver, ISO, and other styles

3

Serpa, Matheus da Silva. "Source code optimizations to reduce multi core and many core performance bottlenecks." reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 2018. http://hdl.handle.net/10183/183139.

Full text

Abstract:

Atualmente, existe uma variedade de arquiteturas disponíveis não apenas para a indústria, mas também para consumidores finais. Processadores multi-core tradicionais, GPUs, aceleradores, como o Xeon Phi, ou até mesmo processadores orientados para eficiência energética, como a família ARM, apresentam características arquiteturais muito diferentes. Essa ampla gama de características representa um desafio para os desenvolvedores de aplicações. Os desenvolvedores devem lidar com diferentes conjuntos de instruções, hierarquias de memória, ou até mesmo diferentes paradigmas de programação ao programar para essas arquiteturas. Para otimizar uma aplicação, é importante ter uma compreensão profunda de como ela se comporta em diferentes arquiteturas. Os trabalhos relacionados provaram ter uma ampla variedade de soluções. A maioria deles se concentrou em melhorar apenas o desempenho da memória. Outros se concentram no balanceamento de carga, na vetorização e no mapeamento de threads e dados, mas os realizam separadamente, perdendo oportunidades de otimização. Nesta dissertação de mestrado, foram propostas várias técnicas de otimização para melhorar o desempenho de uma aplicação de exploração sísmica real fornecida pela Petrobras, uma empresa multinacional do setor de petróleo. Os experimentos mostram que loop interchange é uma técnica útil para melhorar o desempenho de diferentes níveis de memória cache, melhorando o desempenho em até 5,3 e 3,9 nas arquiteturas Intel Broadwell e Intel Knights Landing, respectivamente. Ao alterar o código para ativar a vetorização, o desempenho foi aumentado em até 1,4 e 6,5 . O balanceamento de carga melhorou o desempenho em até 1,1 no Knights Landing. Técnicas de mapeamento de threads e dados também foram avaliadas, com uma melhora de desempenho de até 1,6 e 4,4 . O ganho de desempenho do Broadwell foi de 22,7 e do Knights Landing de 56,7 em comparação com uma versão sem otimizações, mas, no final, o Broadwell foi 1,2 mais rápido que o Knights Landing.
Nowadays, there are several different architectures available not only for the industry but also for final consumers. Traditional multi-core processors, GPUs, accelerators such as the Xeon Phi, or even energy efficiency-driven processors such as the ARM family, present very different architectural characteristics. This wide range of characteristics presents a challenge for the developers of applications. Developers must deal with different instruction sets, memory hierarchies, or even different programming paradigms when programming for these architectures. To optimize an application, it is important to have a deep understanding of how it behaves on different architectures. Related work proved to have a wide variety of solutions. Most of then focused on improving only memory performance. Others focus on load balancing, vectorization, and thread and data mapping, but perform them separately, losing optimization opportunities. In this master thesis, we propose several optimization techniques to improve the performance of a real-world seismic exploration application provided by Petrobras, a multinational corporation in the petroleum industry. In our experiments, we show that loop interchange is a useful technique to improve the performance of different cache memory levels, improving the performance by up to 5.3 and 3.9 on the Intel Broadwell and Intel Knights Landing architectures, respectively. By changing the code to enable vectorization, performance was increased by up to 1.4 and 6.5 . Load Balancing improved the performance by up to 1.1 on Knights Landing. Thread and data mapping techniques were also evaluated, with a performance improvement of up to 1.6 and 4.4 . We also compared the best version of each architecture and showed that we were able to improve the performance of Broadwell by 22.7 and Knights Landing by 56.7 compared to a naive version, but, in the end, Broadwell was 1.2 faster than Knights Landing.

APA, Harvard, Vancouver, ISO, and other styles

4

Shee, Seng Lin Computer Science &amp Engineering Faculty of Engineering UNSW. "ADAPT : architectural and design exploration for application specific instruction-set processor technologies." Awarded by:University of New South Wales, 2007. http://handle.unsw.edu.au/1959.4/35404.

Full text

Abstract:

This thesis presents design automation methodologies for extensible processor platforms in application specific domains. The work presents first a single processor approach for customization; a methodology that can rapidly create different processor configurations by the removal of unused instructions sets from the architecture. A profile directed approach is used to identify frequently used instructions and to eliminate unused opcodes from the available instruction pool. A coprocessor approach is next explored to create an SoC (System-on-Chip) to speedup the application while reducing energy consumption. Loops in applications are identified and accelerated by tightly coupling a coprocessor to an ASIP (Application Specific Instruction-set Processor). Latency hiding is used to exploit the parallelism provided by this architecture. A case study has been performed on a JPEG encoding algorithm; comparing two different coprocessor approaches: a high-level synthesis approach and our custom coprocessor approach. The thesis concludes by introducing a heterogenous multi-processor system using ASIPs as processing entities in a pipeline configuration. The problem of mapping each algorithmic stage in the system to an ASIP configuration is formulated. We proposed an estimation technique to calculate runtimes of the configured multiprocessor system without running cycle-accurate simulations, which could take a significant amount of time. We present two heuristics to efficiently search the design space of a pipeline-based multi ASIP system and compare the results against an exhaustive approach. In our first approach, we show that, on average, processor size can be reduced by 30%, energy consumption by 24%, while performance is improved by 24%. In the coprocessor approach, compared with the use of a main processor alone, a loop performance improvement of 2.57x is achieved using the custom coprocessor approach, as against 1.58x for the high level synthesis method, and 1.33x for the customized instruction approach. Energy savings are 57%, 28% and 19%, respectively. Our multiprocessor design provides a performance improvement of at least 4.03x for JPEG and 3.31x for MP3, for a single processor design system. The minimum cost obtained using our heuristic was within 0.43% and 0.29% of the optimum values for the JPEG and MP3 benchmarks respectively.

APA, Harvard, Vancouver, ISO, and other styles

5

Sid, Lakhdar Riyane Yacine. "Méthodologie pour l'optimisation logicielle de structures de données pour les architectures hautes performances à mémoires complexes." Thesis, Université Grenoble Alpes, 2020. http://www.theses.fr/2020GRALM058.

Full text

Abstract:

La sélection d’une implémentation adéquate de structure de données pour un noyau de calcul donné est un problème critique pour les performances logicielles. La com- plexité de la résolution efficace de ce problème est exacerbée par la concurrence de mémoires matérielles complexes, hétérogènes et dédiées à une application specifique. Modifier légèrement une application optimisée ou la porter sur une nouvelle archi- tecture matérielle nécessite un temps et un effort d’ingénierie considérable. Cela nécessite également une connaissance approfondie de la plateforme matérielle hôte.Au cours de cette thèse, nous franchissons une première étape vers l’optimisation par l’adaptation automatique du logiciel au matériel. Nous présentons une approche itérative d’optimisation basée sur la détection et l’exploration des paramètres les plus influents liés au matériel, au système d’exploitation et au logiciel. La méthode proposée est conçue pour être intégrée dans un compilateur à usage général. Dans ce contexte, nous proposons un algorithme de génération de modèles (entièrement paramétrées) de mémoires caches. Les modèles de performance générés sont conçus pour être utilisé dans le cadre d’évaluations de performances et d’optimisation.Afin d’explorer les paramètres liés à aux structures de données, nous pro- posons HARDSI, une méthode brevetée permettant la résolution du problème de l’agencement des données pour logiciel donné. Dans le but d’appliquer notre méth- ode, nous proposons également un langage dédié (basé sur le langage C/C++) ainsi que son environnement logiciel de compilation et d’exécution. La méthode HARDSI permet de choisir, à partir d’une base de connaissances spécialisée, une implémen- tation optimisée de l’agencement des données en fonction de la géométrie d’accès à la structure de données. Les solutions générées sont également spécifiquement adaptées aux caractéristiques matérielles de la mémoire hôte considérée.De même, nous considérons la résolution du problème de l’agencement des don- nées sur les mémoires singulières qui sont explicitement adressés par le program- meur (tel que les mémoires de type "scratchpad" ou GPU). Le problème que nous abordons est de trouver un emplacement mémoire optimisé afin de maximiser la quantité de données fréquemment accédées et à stocker dans ce type de mémoires rapides bien qu’étroites. Dans ce contexte, nous proposons DDLGS, une méthode brevetée conçue pour générer une implémentation dynamique des données sur mé- moires scratchpad. Ces implémentations sont conçus par DDLGS en considérant le schéma d’accès à la mémoire spécifiquement suivi par le code a optimiser.Dans le but d’évaluer nos implémentations sur différents environnements matériels, nous considérons deux processeurs et mémoires différents: (i) un pro- cesseur x86 implémentant un Intel Xeon à trois niveaux de caches de données et (ii) un processeur massivement parallel implémentant un Kalray Coolidge-80-30 à mé- moire scratchpad sur puce de 16K octets. Les expériences menées sur des noyeaux d’algèbre linéaire, d’intelligence artificielle et de traitement d’images montrent que notre méthode détermine avec précision une implémentation optimisée des struc- tures de données. Ces implémentations permettent d’atteindre une accélération du temps d’exécution jusqu’à 48,9x sur le processeur Xeon et 54,2x sur le Coolidge
With the rising impact of the memory wall, selecting the adequate data-structure implementation for a given kernel has become a performance-critical issue. The complexity of solving efficiently this Data-Layout-Decision (DLD) problem is dra- matically increased by the concurrence of complex, heterogeneous and application- specific hardware memories. Slightly modifying an optimized application or porting it to a new hardware architecture requires an important time and engineering effort. It also requires a deep knowledge of the host hardware platform.In this thesis, we plot a first step toward automatic software-adaptation to hard- ware. We present an iterative data-mining-related software-optimization approach based on the detection and the exploration of the most influential parameters linked to the hardware, operating system and software. We also propose a custom data- cache-miss modeling algorithm designed to be used as fully-parameterized perfor- mance evaluation. The proposed approach is designed to be embedded within a general-purpose compiler.In order to explore the parameters related to the data-layout implementation, we propose HARDSI, a custom patented method to solve the DLD problem. We also propose to apply our method using a custom domain-specific language and computation framework. The HARDSI method allows to choose, from a custom base of knowledge, an optimized data-layout implementation with regards to the memory-pattern followed to access the considered data-structure. The generated solutions are also specifically adapted to the properties of the host hardware-memory.Meanwhile, we consider the singular resolution of the DLD problem on memories that are explicitly addressed by the programmer (such as embedded scratchpad memories or GPUs). The problem that we address is to find an optimized memory- placement in order to maximize the amount of frequently-accessed data to be stored within this fast yet narrow memory. In this context, we propose DDLGS, a custom patented method designed to generate a dynamic data-layout with regards to the followed memory-access pattern. The generated implementations encompass the specific load and store routines as well as the granularity attributed to each data transferred. These implementations are also able to adapt, at run time, to the input of the considered source-code.Aiming to evaluate our implementations on different hardware environments, we have considered two different processor and memory architectures: (i) An x86 pro- cessor implementing an Intel Xeon with three levels of data-caches utilizing the least recently used replacement policy and a (ii) Massively Parallel Processor Array im- plementing a Kalray Coolidge-80-30 with a 16KBytes on-chip scratchpad memory. Experiments on linear algebra, artificial intelligence and image processing bench- marks show that our method accurately determines an optimized data-structure implementation. These implementations allow reaching an execution-time speed-up up to 48.9x on the Xeon processor and 54.2x on the Coolidge processor

APA, Harvard, Vancouver, ISO, and other styles

6

Pinto, Christian <1986&gt. "Many-Core Architectures: Hardware-Software Optimization and Modeling Techniques." Doctoral thesis, Alma Mater Studiorum - Università di Bologna, 2015. http://amsdottorato.unibo.it/6824/.

Full text

Abstract:

During the last few decades an unprecedented technological growth has been at the center of the embedded systems design paramount, with Moore’s Law being the leading factor of this trend. Today in fact an ever increasing number of cores can be integrated on the same die, marking the transition from state-of-the-art multi-core chips to the new many-core design paradigm. Despite the extraordinarily high computing power, the complexity of many-core chips opens the door to several challenges. As a result of the increased silicon density of modern Systems-on-a-Chip (SoC), the design space exploration needed to find the best design has exploded and hardware designers are in fact facing the problem of a huge design space. Virtual Platforms have always been used to enable hardware-software co-design, but today they are facing with the huge complexity of both hardware and software systems. In this thesis two different research works on Virtual Platforms are presented: the first one is intended for the hardware developer, to easily allow complex cycle accurate simulations of many-core SoCs. The second work exploits the parallel computing power of off-the-shelf General Purpose Graphics Processing Units (GPGPUs), with the goal of an increased simulation speed. The term Virtualization can be used in the context of many-core systems not only to refer to the aforementioned hardware emulation tools (Virtual Platforms), but also for two other main purposes: 1) to help the programmer to achieve the maximum possible performance of an application, by hiding the complexity of the underlying hardware. 2) to efficiently exploit the high parallel hardware of many-core chips in environments with multiple active Virtual Machines. This thesis is focused on virtualization techniques with the goal to mitigate, and overtake when possible, some of the challenges introduced by the many-core design paradigm.

APA, Harvard, Vancouver, ISO, and other styles

7

Muffang, Louis. "SLAM Hardware & Software optimization for mobile platform integration." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-294332.

Full text

Abstract:

This thesis work will focus on the optimization of a state-of-the-art monocular Visual-Inertial Odometry (VIO) algorithm for real-time application with limited resources on an embedded system. We will be using a multi-processor unit equipped with a Digital Signal Processor (DSP) to accelerate and offload tasks from the CPU. The goal is to reduce resource consumption without damaging the algorithm performance in speed and accuracy. To this end, we will first identify OpenVINS [1] as a suitable algorithm for this work and find the functions to optimize. When comparing the version of the optimized algorithm with the DSP and its original version, we achieved a similar performance accuracy with more than x1.5 power consumption saving on the CPU and more than x2 memory saving. This work finds its importance in every embedded system which requires a vision-based localization system running along with other CPU heavy tasks.
Denna rapport beskriver optimeringen av en algoritm för icke-stereo Visual- Inertial Odometry (VIO), för realtidsapplikationer med begränsade resurser på inbäddade system. Vi använder en multi-processor enhet utrustad med Digital Signal Processor (DSP)) för att öka prestandan och avlasta huvudprocessorn (CPU:n), så att den kan användas för andra uppgifter parallellt. Målet är att minska resursförbrukningen utan att försämra hastighet eller noggrannhet hos algoritmen. Vi identifierar OpenVINS som en lämplig VIO-algoritm att optimera. Resultatet av studien är att vi lyckas minska minnesåtgången för CPU:n med en faktor 2, och energiförbrukningen med en faktor 1,5. Dessa resultat kan komma till användning i alla system som använder en VIO-algoritm parallellt med andra beräkningskrävande uppgifter.

APA, Harvard, Vancouver, ISO, and other styles

8

Motiwala, Quaeed. "Optimizations for acyclic dataflow graphs for hardware-software codesign." Thesis, This resource online, 1994. http://scholar.lib.vt.edu/theses/available/etd-06302009-040504/.

Full text

APA, Harvard, Vancouver, ISO, and other styles

9

Shen, Chung-Ching. "Energy-driven optimization of hardware and software for distributed embedded systems." College Park, Md.: University of Maryland, 2008. http://hdl.handle.net/1903/8901.

Full text

Abstract:

Thesis (Ph. D.) -- University of Maryland, College Park, 2008.
Thesis research directed by: Dept. of Electrical and Computer Engineering . Title from t.p. of PDF. Includes bibliographical references. Published by UMI Dissertation Services, Ann Arbor, Mich. Also available in paper.

APA, Harvard, Vancouver, ISO, and other styles

10

Brankovic, Aleksandar. "Performance simulation methodologies for hardware/software co-designed processors." Doctoral thesis, Universitat Politècnica de Catalunya, 2015. http://hdl.handle.net/10803/287978.

Full text

Abstract:

Recently the community started looking into Hardware/Software (HW/SW) co-designed processors as potential solutions to move towards the less power consuming and the less complex designs. Unlike other solutions, they reduce the power and the complexity doing so called dynamic binary translation and optimization from a guest ISA to an internal host custom ISA. This thesis tries to answer the question on how to simulate this kind of architectures. For any kind of processor's architecture, the simulation is the common practice, because it is impossible to build several versions of hardware in order to try all alternatives. The simulation of HW/SW co-designed processors has a big issue in comparison with the simulation of traditional HW-only architectures. First of all, open source tools do not exist. Therefore researches many times assume that the software layer overhead, which is in charge for dynamic binary translation and optimization, is constant or ignored. In this thesis we show that such an assumption is not valid and that can lead to very inaccurate results. Therefore including the software layer in the simulation is a must. On the other side, the simulation is very slow in comparison to native execution, so the community spent a big effort on delivering accurate results in a reasonable amount of time. Therefore it is the common practice for HW-only processors that only parts of application stream, which are called samples, are simulated. Samples usually correspond to different phases in the application stream and usually they are no longer than a few million of instructions. In order to archive accurate starting state of each sample, microarchitectural structures are warmed-up for a few million instructions prior to samples instructions. Unfortunately, such a methodology cannot be directly applied for HW/SW co-designed processors. The warm-up for HW/SW co-designed processors needs to be 3-4 orders of magnitude longer than the warm-up needed for traditional HW-only processor, because the warm-up of software layer needs to be longer than the warm-up of hardware structures. To overcome such a problem, in this thesis we propose a novel warm-up technique specialized for HW/SW co-designed processors. Our solution reduces the simulation time by at least 65X with an average error of just 0.75\%. Such a trend is visible for different software and hardware configurations. The process used to determine simulation samples cannot be applied to HW/SW co-designed processors as well, because due to the software layer, samples show more dissimilarities than in the case of HW-only processors. Therefore we propose a novel algorithm that needs 3X less number of samples to achieve similar error like the state of the art algorithms. Again, such a trend is visible for different software and hardware configurations.
Els processadors co-dissenyats Hardware/Software (HW/SW co-designed processors) han estat proposats per l'acadèmia i la indústria com a solucions potencials per a fabricar processadors menys complexos i que consumeixen menys energia. A diferència d'altres alternatives, aquest tipus de processadors redueixen la complexitat i el consum d'energia aplicant traducció y optimització dinàmica de binaris des d'un repertori d'instruccions (instruction set architecture) extern cap a un repertori d'instruccions intern adaptat. Aquesta tesi intenta resoldre els reptes relacionats a la simulació d'aquest tipus d'arquitectures. La simulació és un procés comú en el disseny i desenvolupament de processadors ja que permet explorar diverses alternatives sense haver de fabricar el hardware per a cadascuna d'elles. La simulació de processadors co-dissenyats Hardware/Software és un procés més complex que la simulació de processadores tradicionals, purament hardware. Per exemple, no existeixen eines de simulació disponibles per a la comunitat. Per tant, els investigadors acostumen a assumir que la capa de software, que s'encarrega de la traducció i optimització de les aplicacions, no té un pes específic i, per tant, uns costos computacionals baixos o constants en el millor dels casos. En aquesta tesis demostrem que aquestes premisses són incorrectes i que els resultats amb aquestes acostumen a ser molt imprecisos. Una primera conclusió d'aquesta tesi doncs és que la simulació de la capa software és totalment necessària. A més a més, degut a que els processos de simulació són lents, s'han proposat tècniques de simulació que intenten obtenir resultats precisos en el menor temps possible. Una pràctica habitual és la simulació només de parts de les aplicacions, anomenades mostres, en el disseny de processadors convencionals, purament hardware. Aquestes mostres corresponen a diferents fases de les aplicacions i acostumen a ser de pocs milions d'instruccions. Per tal d'aconseguir un estat microarquitectònic acurat per a cadascuna de les mostres, s'acostumen a estressar aquestes estructures microarquitectòniques del simulador abans de començar a extreure resultats, procés anomenat "escalfament" (warm-up). Desafortunadament, aquesta metodologia no pot ser aplicada a processadors co-dissenyats Hardware/Software. L'"escalfament" de les estructures internes del simulador en el disseny de processadores co-dissenyats Hardware/Software són 3-4 ordres de magnitud més gran que el mateix procés d' "escalfament" en simulacions de processadors convencionals, ja que en els primers cal "escalfar" també les estructures i l'estat de la capa software. En aquesta tesi proposem tècniques de simulació basades en l' "escalfament" de les estructures que redueixen el temps de simulació en 65X amb un error mig del 0,75%. Aquests resultats són extrapolables a diferents configuracions del hardware i de la capa software. Finalment, les tècniques convencionals de selecció de mostres d'aplicacions a simular no són aplicables tampoc a la simulació de processadors co-dissenyats Hardware/Software degut a que les mostres es comporten de manera molt diferent quan es té en compte la capa software. En aquesta tesi, proposem un nou algorisme que redueix 3X el nombre de mostres a simular comparat amb els algorismes tradicionals per a processadors convencionals per a obtenir un error similar. Aquests resultats també són extrapolables a diferents configuracions de hardware i de software. En conclusió, en aquesta tesi es respon al repte de com simular processadors co-dissenyats Hardware/Software, que són una alternativa al disseny tradicional de processadors. Hem demostrat que cal simular la capa software i s'han proposat noves tècniques i algorismes eficients d' "escalfament" i selecció de mostres que són tolerants a diferents configuracions

APA, Harvard, Vancouver, ISO, and other styles

11

Oh, Jungju. "Efficient hardware and software assist for many-core performance." Diss., Georgia Institute of Technology, 2013. http://hdl.handle.net/1853/50219.

Full text

Abstract:

In recent years, the number of available cores in a processor are increasing rapidly while the pace of performance improvement of an individual core has been lagged. It led application developers to extract more parallelism from a number of cores to make their applications run faster. However, writing a parallel program that scales well with the increasing core counts is challenging. Consequently, many parallel applications suffer from performance bugs caused by scalability limiters. We expect core counts to continue to increase for the foreseeable future and hence, addressing scalability limiters is important for better performance on future hardware. With this thesis, I propose both software frameworks and hardware improvements that I developed to address three important scalability limiters: load imbalance, barrier latency and increasing on-chip packet latency. First, I introduce a debugging framework for load imbalance called LIME. The LIME framework uses profiling, statistical analysis and control flow graph analysis to automatically determine the nature of load imbalance problems and pinpoint the code where the problems are introduced. Second, I address scalability problem of the barrier, which has become costly and difficult to achieve scalable performance. To address this problem, I propose a transmission line (TL) based hardware barrier support, called TLSync, that is orders of magnitude faster than software barrier implementation while supports many (tens) of barriers simultaneously using a single chip-spanning network. Third and lastly, I focus on the increasing packet latency in on-chip network, and propose a hybrid interconnection where a low-latency TL based interconnect is synergistically used with a high-throughput switched interconnect. Also, a new adaptive packet steering policy is created to judiciously use the limited throughput available on the low-latency TL interconnect.

APA, Harvard, Vancouver, ISO, and other styles

12

Bekker, Dmitriy L. "Hardware and software optimization of Fourier transform infrared spectrometry on hybrid-FPGAs /." Online version of thesis, 2007. http://hdl.handle.net/1850/4805.

Full text

APA, Harvard, Vancouver, ISO, and other styles

13

Figueiredo, Boneti Carlos Santieri de. "Exploring coordinated software and hardware support for hardware resource allocation." Doctoral thesis, Universitat Politècnica de Catalunya, 2009. http://hdl.handle.net/10803/6018.

Full text

Abstract:

Multithreaded processors are now common in the industry as they offer high performance at a low cost. Traditionally, in such processors, the assignation of hardware resources between the multiple threads is done implicitly, by the hardware policies. However, a new class of multithreaded hardware allows the explicit allocation of resources to be controlled or biased by the software. Currently, there is little or no coordination between the allocation of resources done by the hardware and the prioritization of tasks done by the software.
This thesis targets to narrow the gap between the software and the hardware, with respect to the hardware resource allocation, by proposing a new explicit resource allocation hardware mechanism and novel schedulers that use the currently available hardware resource allocation mechanisms.
It approaches the problem in two different types of computing systems: on the high performance computing domain, we characterize the first processor to present a mechanism that allows the software to bias the allocation hardware resources, the IBM POWER5. In addition, we propose the use of hardware resource allocation as a way to balance high performance computing applications. Finally, we propose two new scheduling mechanisms that are able to transparently and successfully balance applications in real systems using the hardware resource allocation. On the soft real-time domain, we propose a hardware extension to the existing explicit resource allocation hardware and, in addition, two software schedulers that use the explicit allocation hardware to improve the schedulability of tasks in a soft real-time system.
In this thesis, we demonstrate that system performance improves by making the software aware of the mechanisms to control the amount of resources given to each running thread. In particular, for the high performance computing domain, we show that it is possible to decrease the execution time of MPI applications biasing the hardware resource assignation between threads. In addition, we show that it is possible to decrease the number of missed deadlines when scheduling tasks in a soft real-time SMT system.

APA, Harvard, Vancouver, ISO, and other styles

14

Ramírez, Bellido Alejandro. "High performance instruction fetch using software and hardware co-design." Doctoral thesis, Universitat Politècnica de Catalunya, 2002. http://hdl.handle.net/10803/5969.

Full text

Abstract:

En los últimos años, el diseño de procesadores de altas prestaciones ha progresado a lo largo de dos corrientes de investigación: incrementar la profundidad del pipeline para permitir mayores frecuencias de reloj, y ensanchar el pipeline para permitir la ejecución paralela de un mayor numero de instrucciones. Diseñar un procesador de altas prestaciones implica balancear todos los componentes del procesador para asegurar que el rendimiento global no esta limitado por ningún componente individual. Esto quiere decir que si dotamos al procesador de una unidad de ejecución mas rápida, hay que asegurarse de que podemos hacer fetch y decodificar instrucciones a una velocidad suficiente para mantener ocupada a esa unidad de ejecución.

Esta tesis explora los retos presentados por el diseño de la unidad de fetch desde dos puntos de vista: el diseño de un software mas adecuado para las arquitecturas de fetch ya existente, y el diseño de un hardware adaptado a las características especiales del nuevo software que hemos generado.

Nuestra aproximación al diseño de un suevo software ha sido la propuesta de un nuevo algoritmo de reordenación de código que no solo pretende mejorar el rendimiento de la cache de instrucciones, sino que al mismo tiempo pretende incrementar la anchura efectiva de la unidad de fetch. Usando información sobre el comportamiento del programa (profile data), encadenamos los bloques básicos del programa de forma que los saltos condicionales tendrán tendencia a ser no tomados, lo cual favorece la ejecución secuencial del código. Una vez hemos organizado los bloques básicos en estas trazas, mapeamos las diferentes trazas en memoria de forma que minimicen la cantidad de espacio requerida para el código realmente útil, y los conflictos en memoria de este código. Además de describir el algoritmo, hemos realizado un análisis en detalle del impacto de estas optimizaciones sobre los diferentes aspectos del rendimiento de la unidad de fetch: la latencia de memoria, la anchura efectiva de la unidad de fetch, y la capacidad de predicción del predictor de saltos.

Basado en el análisis realizado sobre el comportamiento de los códigos optimizados, proponemos también una modificacion del mecanismo de la trace cache que pretende realizar un uso mas efectivo del escaso espacio de almacenaje disponible. Este mecanismo utiliza la trace cache únicamente para almacenar aquellas trazas que no podrían ser proporcionadas por la cache de instrucciones en un único ciclo.

También basado en el conocimiento adquirido sobre el comportamiento de los códigos optimizados, proponemos un nuevo predictor de saltos que hace un uso extensivo de la misma información que se uso para reordenar el código, pero en este caso se usa para mejorar la precisión del predictor de saltos.

Finalmente, proponemos una nueva arquitectura para la unidad de fetch del procesador basada en explotar las características especiales de los códigos optimizados. Nuestra arquitectura tiene un nivel de complejidad muy bajo, similar al de una arquitectura capaz de leer un único bloque básico por ciclo, pero ofrece un rendimiento muy superior, siendo comparable al de una trace cache, mucho mas costosa y compleja.

APA, Harvard, Vancouver, ISO, and other styles

15

Schultz, Eric A. "Empirical Performance Comparison of Hardware and Software Task Context Switching." Thesis, Monterey, California. Naval Postgraduate School, 2009. http://hdl.handle.net/10945/46226.

Full text

Abstract:

Approved for public release; distribution unlimited.
There are many divergent opinions regarding possible differences between the performance of hardware and software context switching implementations. However, there are no concrete empirical measures of their true differences. Using an empirical testing methodology, this research performed seven experiments, collecting quantitative performance results on hardware and software-based context switch implementations with two and four hardware privilege level support. The implementations measured are the hardware-based Intel IA-32 context switch, the software-based MINIX 3 context switch, a software-based simulation of a MINIX 3 context switch with four hardware privileged level support, and a software-based simulation of an Intel IA-32 hardware context switch. Experiments were executed using the Trusted Computing Exemplar Least Privilege Separation Kernel and the Linux 2.6 Kernel. The results include the number of cycles and time required to complete processing of each implementation. This study concludes that the hardware-based context switching mechanism is significantly slower than software implementation, even those that simulate the elaborate checks of the hardware implementation. A possible reason for this is posited.

APA, Harvard, Vancouver, ISO, and other styles

16

Chaudhuri, Matthew Alan. "Optimization of a hardware/software coprocessing platform for EEG eyeblink detection and removal /." Online version of thesis, 2008. http://hdl.handle.net/1850/8967.

Full text

APA, Harvard, Vancouver, ISO, and other styles

17

Powell, Richard, and Jeff Kuhn. "HARDWARE- VS. SOFTWARE-DRIVEN REAL-TIME DATA ACQUISITION." International Foundation for Telemetering, 2000. http://hdl.handle.net/10150/608291.

Full text

Abstract:

International Telemetering Conference Proceedings / October 23-26, 2000 / Town & Country Hotel and Conference Center, San Diego, California
There are two basic approaches to developing data acquisition systems. The first is to buy or develop acquisition hardware and to then write software to input, identify, and distribute the data for processing, display, storage, and output to a network. The second is to design a system that handles some or all of these tasks in hardware instead of software. This paper describes the differences between software-driven and hardware-driven system architectures as applied to real-time data acquisition systems. In explaining the characteristics of a hardware-driven system, a high-performance real-time bus system architecture developed by L-3 will be used as an example. This architecture removes the bottlenecks and unpredictability that can plague software-driven systems when applied to complex real-time data acquisition applications. It does this by handling the input, identification, routing, and distribution of acquired data without software intervention.

APA, Harvard, Vancouver, ISO, and other styles

18

Sredojević, Ranko Radovin. "Template-based hardware-software codesign for high-performance embedded numerical accelerators." Thesis, Massachusetts Institute of Technology, 2013. http://hdl.handle.net/1721.1/84895.

Full text

Abstract:

Thesis (Ph. D.)--Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2013.
Cataloged from PDF version of thesis.
Includes bibliographical references (pages 129-132).
Sophisticated algorithms for control, state estimation and equalization have tremendous potential to improve performance and create new capabilities in embedded and mobile systems. Traditional implementation approaches are not well suited for porting these algorithmic solutions into practical implementations within embedded system constraints. Most of the technical challenges arise from design approach that manipulates only one level in the design stack, thus being forced to conform to constraints imposed by other levels without question. In tightly constrained environments, like embedded and mobile systems, such approaches have a hard time efficiently delivering and delivering efficiency. In this work we offer a solution that cuts through all the design stack layers. We build flexible structures at the hardware, software and algorithm level, and approach the solution through design space exploration. To do this efficiently we use a template-based hardware-software development flow. The main incentive for template use is, as in software development, to relax the generality vs. efficiency/performance type tradeoffs that appear in solutions striving to achieve run-time flexibility. As a form of static polymorphism, templates typically incur very little performance overhead once the design is instantiated, thus offering the possibility to defer many design decisions until later stages when more is known about the overall system design. However, simply including templates into design flow is not sufficient to result in benefits greater than some level of code reuse. In our work we propose using templates as flexible interfaces between various levels in the design stack. As such, template parameters become the common language that designers at different levels of design hierarchy can use to succinctly express their assumptions and ideas. Thus, it is of great benefit if template parameters map directly and intuitively into models at every level. To showcase the approach we implement a numerical accelerator for embedded Model Predictive Control (MPC) algorithm. While most of this work and design flow are quite general, their full power is realized in search for good solutions to a specific problem. This is best understood in direct comparison with recent works on embedded and high-speed MPC implementations. The controllers we generate outperform published works by a handsome margin in both speed and power consumption, while taking very little time to generate.
by Ranko Radovin Sredojević.
Ph.D.

APA, Harvard, Vancouver, ISO, and other styles

19

Marques, Vítor Manuel dos Santos. "Performance of hardware and software sorting algorithms implemented in a SOC." Master's thesis, Universidade de Aveiro, 2017. http://hdl.handle.net/10773/23467.

Full text

Abstract:

Mestrado em Engenharia de Computadores e Telemática
Field Programmable Gate Arrays (FPGAs) were invented by Xilinx in 1985. Their reconﬁgurable nature allows to use them in multiple areas of Information Technologies. This project aims to study this technology to be an alternative to traditional data processing methods, namely sorting. The proposed solution is based on the principle of reusing resources to counter this technology’s known resources limitations.
As Field Programmable Gate Arrays (FPGAs) foram inventadas em 1985 pela Xilinx. A sua natureza reconﬁguratória permite que sejam utilizadas em várias áreas das tecnologias de informação. Este trabalho tem como objectivo estudar o uso desta tecnologia como alternativa aos métodos tradicionais de processamento de dados, nomeadamente a ordenação. A solução proposta baseia-se na reutilização de recursos para combater as conhecidas limitações deste tipo de tecnologia.

APA, Harvard, Vancouver, ISO, and other styles

20

O'Connor, R. Brendan. "Dataflow Analysis and Optimization of High Level Language Code for Hardware-Software Co-Design." Thesis, Virginia Tech, 1996. http://hdl.handle.net/10919/36653.

Full text

Abstract:

Recent advancements in FPGA technology have provided devices which are not only suited for digital logic prototyping, but also are capable of implementing complex computations. The use of these devices in multi-FPGA Custom Computing Machines (CCMs) has provided the potential to execute large sections of programs entirely in custom hardware which can provide a substantial speedup over execution in a general-purpose sequential processor. Unfortunately, the development tools currently available for CCMs do not allow users to easily configure multi-FPGA platforms. In order to exploit the capabilities of such an architecture, a procedure has been developed to perform a dataflow analysis of programs written in C which is capable of several hardware-specific optimizations. This, together with other software tools developed for this purpose, allows CCMs and their host processors to be targeted from the same high-level specification.
Master of Science

APA, Harvard, Vancouver, ISO, and other styles

21

Igual, Pérez Román José. "Platform Hardware/Software for the energy optimization in a node of wireless sensor networks." Thesis, Lille 1, 2016. http://www.theses.fr/2016LIL10041/document.

Full text

Abstract:

L'incroyable augmentation d'objets connectés dans le monde de l'Internet des Objets impliquera plusieurs problèmes. L'efficacité énergétique est un des principaux. Le présent travail étudie l'efficacité énergétique et, plus précisément, la modélisation de l'énergie consommée par le nœud.Nous avons créé une plateforme matérielle et logicielle appelée Synergie. Cette plateforme est composée d'un ensemble d'outils matériel/logiciel :- un dispositif de mesure de la consommation d'énergie;- un algorithme qui crée automatiquement un modèle de la consommation de l'énergie;- un estimateur de la durée de vie du nœud.La plateforme des mesures de l'énergie récupère les valeurs de courant directement du nœud. Ces courants sont mesurés composant par composant du circuit et fonction par fonction du logiciel embarqué. Cette analyse matérielle/logicielle offre information sur le comportement de chaque composant.Un algorithme crée automatiquement un modèle de la consommation énergétique basé sur une chaîne de Markov. Ce modèle est une représentation stochastique du comportement énergétique du nœud en fonctionnement in situ. Le nœud fonctionne dans un réseau réel et dans des conditions réelles de canal.Finalement, une estimation de la durée de vie du nœud est réalisée en utilisant des modèles de batterie. L'estimation est possible grâce au caractère stochastique du modèle de la consommation. La possibilité de simplement changer les paramètres de consommation pour améliorer la durée de vie est présentée.Ce travail représente la première étape d'un projet global qui a pour but obtenir des réseaux de capteurs sans fil autonomes en énergie
The significant increase of connected objects in Internet of Things will entail different problems. Among them, the energy efficiency. The present work deals with the energy efficiency and more precisely with the study of the modeling of the energy consumption in the node.We have designed a platform to instrument a node of wireless sensor network in its real environment. The hardware and software platform is made of:- a hardware energy measurement platform;- a software allowing the automatic generation of an energy consumption model;- a node lifetime estimation algorithm.The energy measurement platform recovers the current values directly from the node under evaluation, component per component in the electronic circuit and function per function of the embedded software. This hardware/software analysis of the energy consumption offers important information about the behavior of each electronic component in the node.An algorithm carries out a statistical analysis of the energy measurements. This algorithm creates automatically an energy consumption model based on a Markov chain. Thus, this platform allows to create a stochastic model of the energy behavior of a real node, in a real network and in real channel conditions. The model is made in contrast to the deterministic energy models found in the literature, whose energy behavior is extracted from the datasheets of the components. Finally, we estimate the node lifetime based on battery models. We also show on examples the simplicity to change some parameters of the model in order to improve the energy efficiency

APA, Harvard, Vancouver, ISO, and other styles

22

Suuronen, Janne. "Towards Defining Models of Hardware Capacity and Software Performance for Telecommunication Applications." Thesis, Mälardalens högskola, Inbyggda system, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:mdh:diva-48773.

Full text

Abstract:

Knowledge of the resource usage of applications and the resource usage capacity of hardware platforms is essential when developing a system. The resource usage must not over exceed the capacity of a platform, as it could otherwise fail to meet its real-time constraints due to resource shortages. Furthermore, it is beneficial from a cost-effectiveness stand-point that a hardware platform is not under-utilised by systems software. This thesis examines two systems aspects: the hardware resource usage of applications and the resource capacity of hardware platforms, defined as the capacity of each resource included in a hardware platform. Both of these systems aspects are investigated and modelled using a black box perspective since the focus is on observing the online usage and capacity. Investigating and modelling these two approaches is a crucial step towards defining and constructing hardware and software models. We evaluate regressive and auto-regressive modelling approaches of modelling CPU, L2 cache and L3 cache usage of applications. The conclusion is that first-order autoregressive and Multivariate Adaptive Regression Splines show promise of being able to model resource usage. The primary limitation of both modelling approaches is their inability to model resource usage when it is highly irregular. The capacity models of CPU, L2 and L3 cache derived by exerting heavy workloads onto a test platform shows to hold against a real-life application concerning L2 and L3 cache capacity. However, the CPU usage model underestimates the test platform's capacity since the real-life application over-exceeds the theoretical maximum usage defined by the model.

APA, Harvard, Vancouver, ISO, and other styles

23

Rosqvist, Åkerblom Linn. "JavaScript Performance and Optimization : Removing bottlenecks." Thesis, Mittuniversitetet, Avdelningen för data- och systemvetenskap, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:miun:diva-25474.

Full text

APA, Harvard, Vancouver, ISO, and other styles

24

Patki, Tapasya. "The Case For Hardware Overprovisioned Supercomputers." Diss., The University of Arizona, 2015. http://hdl.handle.net/10150/577307.

Full text

Abstract:

Power management is one of the most critical challenges on the path to exascale supercomputing. High Performance Computing (HPC) centers today are designed to be worst-case power provisioned, leading to two main problems: limited application performance and under-utilization of procured power. In this dissertation we introduce hardware overprovisioning: a novel, flexible design methodology for future HPC systems that addresses the aforementioned problems and leads to significant improvements in application and system performance under a power constraint. We first establish that choosing the right configuration based on application characteristics when using hardware overprovisioning can improve application performance under a power constraint by up to 62%. We conduct a detailed analysis of the infrastructure costs associated with hardware overprovisioning and show that it is an economically viable supercomputing design approach. We then develop RMAP (Resource MAnager for Power), a power-aware, low-overhead, scalable resource manager for future hardware overprovisioned HPC systems. RMAP addresses the issue of under-utilized power by using power-aware backfilling and improves job turnaround times by up to 31%. This dissertation opens up several new avenues for research in power-constrained supercomputing as we venture toward exascale, and we conclude by enumerating these.

APA, Harvard, Vancouver, ISO, and other styles

25

Ewing, John M. "Autonomic Performance Optimization with Application to Self-Architecting Software Systems." Thesis, George Mason University, 2015. http://pqdtopen.proquest.com/#viewpdf?dispub=3706982.

Full text

Abstract:

Service Oriented Architectures (SOA) are an emerging software engineering discipline that builds software systems and applications by connecting and integrating well-defined, distributed, reusable software service instances. SOA can speed development time and reduce costs by encouraging reuse, but this new service paradigm presents significant challenges. Many SOA applications are dependent upon service instances maintained by vendors and/or separate organizations. Applications and composed services using disparate providers typically demonstrate limited autonomy with contemporary SOA approaches. Availability may also suffer with the proliferation of possible points of failure—restoration of functionality often depends upon intervention by human administrators.

Autonomic computing is a set of technologies that enables self-management of computer systems. When applied to SOA systems, autonomic computing can provide automatic detection of faults and take restorative action. Additionally, autonomic computing techniques possess optimization capabilities that can leverage the features of SOA (e.g., loose coupling) to enable peak performance in the SOA system's operation. This dissertation demonstrates that autonomic computing techniques can help SOA systems maintain high levels of usefulness and usability.

This dissertation presents a centralized autonomic controller framework to manage SOA systems in dynamic service environments. The centralized autonomic controller framework can be enhanced through a second meta-optimization framework that automates the selection of optimization algorithms used in the autonomic controller. A third framework for autonomic meta-controllers can study, learn, adjust, and improve the optimization procedures of the autonomic controller at run-time. Within this framework, two different types of meta-controllers were developed. The Overall Best meta-controller tracks overall performance of different optimization procedures. Context Best meta-controllers attempt to determine the best optimization procedure for the current optimization problem. Three separate Context Best meta-controllers were implemented using different machine learning techniques: 1) K-Nearest Neighbor (KNN MC), 2) Support Vector Machines (SVM) trained offline (Offline SVM), and 3) SVM trained online (Online SVM).

A detailed set of experiments demonstrated the effectiveness and scalability of the approaches. Autonomic controllers of SOA systems successfully maintained performance on systems with 15, 25, 40, and 65 components. The Overall Best meta-controller successfully identified the best optimization technique and provided excellent performance at all levels of scale. Among the Context Best meta-controllers, the Online SVM meta-controller was tested on the 40 component system and performed better than the Overall Best meta-controller at a 95% confidence level. Evidence indicates that the Online SVM was successfully learning which optimization procedures were best applied to encountered optimization problems. The KNN MC and Offline SVM were less successful. The KNN MC struggled because the KNN algorithm does not account for the asymmetric cost of prediction errors. The Offline SVM was unable to predict the correct optimization procedure with sufficient accuracy—this was likely due to the challenge of building a relevant offline training set. The meta-optimization framework, which was tested on the 65 component system, successfully improved the optimization techniques used by the autonomic controller.

The meta-optimization and meta-controller frameworks described in this dissertation have broad applicability in autonomic computing and related fields. This dissertation also details a technique for measuring the overlap of two populations of points, establishes an approach for using penalty weights to address one-sided overfitting by SVM on asymmetric data sets, and develops a set of high performance data structure and heuristic search templates for C++.

APA, Harvard, Vancouver, ISO, and other styles

26

Dudebout, Nicolas. "Multigigabit multimedia processor for 60GHz WPAN a hardware software codesign implementation /." Thesis, Atlanta, Ga. : Georgia Institute of Technology, 2008. http://hdl.handle.net/1853/26677.

Full text

Abstract:

Thesis (M. S.)--Electrical and Computer Engineering, Georgia Institute of Technology, 2009.
Committee Member: Chang, Gee-Kung; Committee Member: Hasler, Paul; Committee Member: Laskar, Joy. Part of the SMARTech Electronic Thesis and Dissertation Collection.

APA, Harvard, Vancouver, ISO, and other styles

27

Subramanian, Sriram. "Software Performance Estimation Techniques in a Co-Design Environment." University of Cincinnati / OhioLINK, 2003. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1061553201.

Full text

APA, Harvard, Vancouver, ISO, and other styles

28

Ganpaa, Gayatri. "An R*-Tree Based Semi-Dynamic Clustering Method for the Efficient Processing of Spatial Join in a Shared-Nothing Parallel Database System." ScholarWorks@UNO, 2006. http://scholarworks.uno.edu/td/298.

Full text

Abstract:

The growing importance of geospatial databases has made it essential to perform complex spatial queries efficiently. To achieve acceptable performance levels, database systems have been increasingly required to make use of parallelism. The spatial join is a computationally expensive operator. Efficient implementation of the join operator is, thus, desirable. The work presented in this document attempts to improve the performance of spatial join queries by distributing the data set across several nodes of a cluster and executing queries across these nodes in parallel. This document discusses a new parallel algorithm that implements the spatial join in an efficient manner. This algorithm is compared to an existing parallel spatial-join algorithm, the clone join. Both algorithms have been implemented on a Beowulf cluster and compared using real datasets. An extensive experimental analysis reveals that the proposed algorithm exhibits superior performance both in declustering time as well as in the execution time of the join query.

APA, Harvard, Vancouver, ISO, and other styles

29

Linford, John Christian. "Accelerating Atmospheric Modeling Through Emerging Multi-core Technologies." Diss., Virginia Tech, 2010. http://hdl.handle.net/10919/27599.

Full text

Abstract:

The new generations of multi-core chipset architectures achieve unprecedented levels of computational power while respecting physical and economical constraints. The cost of this power is bewildering program complexity. Atmospheric modeling is a grand-challenge problem that could make good use of these architectures if they were more accessible to the average programmer. To that end, software tools and programming methodologies that greatly simplify the acceleration of atmospheric modeling and simulation with emerging multi-core technologies are developed. A general model is developed to simulate atmospheric chemical transport and atmospheric chemical kinetics. The Cell Broadband Engine Architecture (CBEA), General Purpose Graphics Processing Units (GPGPUs), and homogeneous multi-core processors (e.g. Intel Quad-core Xeon) are introduced. These architectures are used in case studies of transport modeling and kinetics modeling and demonstrate per-kernel speedups as high as 40x. A general analysis and code generation tool for chemical kinetics called "KPPA" is developed. KPPA generates highly tuned C, Fortran, or Matlab code that uses every layer of heterogeneous parallelism in the CBEA, GPGPU, and homogeneous multi-core architectures. A scalable method for simulating chemical transport is also developed. The Weather Research and Forecasting Model with Chemistry (WRF-Chem) is accelerated with these methods with good results: real forecasts of air quality are generated for the Eastern United States 65% faster than the state-of-the-art models.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

30

Aryan, Omid. "A hardware-defined approach to software-defined radios : improving performance without trading In flexibility." Thesis, Massachusetts Institute of Technology, 2013. http://hdl.handle.net/1721.1/85402.

Full text

Abstract:

Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2013.
Cataloged from PDF version of thesis.
Includes bibliographical references (pages 93-94).
The thesis presents an implementation of a general DSP framework on the Texas Instruments OMAP-L138 processor. Today's software-defined radios suffer from fundamental drawbacks that inhibit their use in practical settings. These drawbacks include their large sizes, their dependence on a PC for digital signal processing operations, and their inability to process signals in real-time. Furthermore, FPGA-based implementations that achieve higher performances lack the flexibility that software implementations provide. The present implementation endeavors to overcome these issues by utilizing a processor that is low-power, small in size, and that provides a library of assembly-level optimized functions in order to achieve much faster performance with a software implementation. The evaluations show substantial improvements in performance when the DSP framework is implemented with the OMAP-L138 processor compared to that achieved with other software implemented radios.
by Omid Aryan.
M. Eng.

APA, Harvard, Vancouver, ISO, and other styles

31

Paolillo, Antonio. "Optimisation of Performance Metrics of Embedded Hard Real-Time Systems using Software/Hardware Parallelism." Doctoral thesis, Universite Libre de Bruxelles, 2018. http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/277427.

Full text

Abstract:

Optimisation of Performance Metrics of Embedded Hard Real-Time Systems using Software/Hardware Parallelism. Nowadays, embedded systems are part of our daily lives.Some of these systems are called safetycritical and have strong requirements in terms of safety and reliability.Additionally, these systems must have a long autonomy, good performance and minimal costs.Finally, these systems must exhibit predictable behaviour and provide their results within firm deadlines.When these different constraints are combined in the requirement specifications of a modern product, classic design techniques making use of single core platforms are not sufficient.Academic research in the field of real-time embedded systems has produced numerous techniques to exploit the capabilities of modern hardware platforms.These techniques are often based on using parallelism inherently present in modern hardware to improve the system performance while reducing the platform power dissipation.However, very few systems existing on the market are using these state-of-the-art techniques.Moreover, few of these techniques have been validated in the context of practical experiments.In this thesis, we realise the study of operating system level techniques allowing to exploit hardware parallelism through the implementation of parallel software in order to boost the performance of target applications and to reduce the overall system energy consumption while satisfying strict application timing requirements.We detail the theoretical foundations of the ideas applied in the dissertation and validate these ideas through experimental work.To this aim, we use a new Real-Time Operating System kernel written in the context of the creation of a spin-off of the Université libre de Bruxelles.Our experiments are based on the execution of applications on the operating system which run on a real-world platform for embedded systems.Our results show that, compared to traditional design techniques, using parallel and power-aware scheduling techniques in order to exploit hardware and software parallelism allows to execute embedded applications with substantial savings in terms of energy consumption.We present future and ongoing research work that exploit the capabilities of recent embedded platforms.These platforms combine multi-core processors and reconfigurable hardware logic, allowing further improvements in performance and energy consumption.
Optimisation de Métriques de Performances de Systèmes Embarqués Temps Réel Durs par utilisation du Parallélisme Logiciel et Matériel. De nos jours, les systèmes embarqués font partie intégrante de notre quotidien.Certains de ces systèmes, appelés systèmes critiques, sont soumis à de fortes contraintes de fiabilité et de robustesse.De plus, des contraintes de coûts, d’autonomie et de performances s’additionnent à la fiabilité.Enfin, ces systèmes doivent très souvent respecter des délais très stricts de façon prédictible.Lorsque ces différentes contraintes sont combinées dans le cahier de charge d’un produit, les techniques classiques de conception consistant à utiliser un seul cœur d’un processeur ne suffisent plus.La recherche académique dans le domaine des systèmes embarqués temps réel a produit de nombreuses techniques pour exploiter les plate-formes modernes.Ces techniques sont souvent basées sur l’exploitation du parallélisme inhérent au matériel pour améliorer les performances du système et la puissance dissipée par la plate-forme.Cependant, peu de systèmes existant sur le marché exploitent ces techniques de la littérature et peu de ces techniques ont été validées dans le cadre d’expériences pratiques.Dans cette thèse, nous réalisons l’étude des techniques, au niveau du système d’exploitation, permettant l’exploitation du parallélisme matériel par l’implémentation de logiciels parallèles afin de maximiser les performances et réduire l’impact sur l’énergie consommée tout en satisfaisant les contraintes temporelles strictes du cahier de charge applicatif. Nous détaillons les fondements théoriques des idées qui sont appliquées dans la dissertation et nous les validons par des travaux expérimentaux.A ces fins, nous utilisons le nouveau noyau d’un système d’exploitation écrit dans le cadre de la création d’une spin-off de l’Université libre de Bruxelles.Nos expériences, basées sur l’exécution d’applications sur le système d’exploitation qui s’exécute lui-même sur une plate-forme embarquée réelle, montre que l’utilisation de techniques d’ordonnancement exploitant le parallélisme matériel et logiciel permet de larges économies d’énergie consommée lors de l’exécution d’applications embarquées.De futurs travaux en cours de réalisation sont présentés.Ceux-ci exploitent des plate-formes innovantes qui combinent processeurs multi-cœurs et matériel reconfigurable, permettant d’aller encore plus loin dans l’amélioration des performances et les gains énergétiques.
Doctorat en Sciences
info:eu-repo/semantics/nonPublished

APA, Harvard, Vancouver, ISO, and other styles

32

Kong, Martin Richard. "Enabling Task Parallelism on Hardware/Software Layers using the Polyhedral Model." The Ohio State University, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=osu1452252422.

Full text

APA, Harvard, Vancouver, ISO, and other styles

33

Cornevaux-Juignet, Franck. "Hardware and software co-design toward flexible terabits per second traffic processing." Thesis, Ecole nationale supérieure Mines-Télécom Atlantique Bretagne Pays de la Loire, 2018. http://www.theses.fr/2018IMTA0081/document.

Full text

Abstract:

La fiabilité et la sécurité des réseaux de communication nécessitent des composants efficaces pour analyser finement le trafic de données. La diversification des services ainsi que l'augmentation des débits obligent les systèmes d'analyse à être plus performants pour gérer des débits de plusieurs centaines, voire milliers de Gigabits par seconde. Les solutions logicielles communément utilisées offrent une flexibilité et une accessibilité bienvenues pour les opérateurs du réseau mais ne suffisent plus pour répondre à ces fortes contraintes dans de nombreux cas critiques.Cette thèse étudie des solutions architecturales reposant sur des puces programmables de type Field-Programmable Gate Array (FPGA) qui allient puissance de calcul et flexibilité de traitement. Des cartes équipées de telles puces sont intégrées dans un flot de traitement commun logiciel/matériel afin de compenser les lacunes de chaque élément. Les composants du réseau développés avec cette approche innovante garantissent un traitement exhaustif des paquets circulant sur les liens physiques tout en conservant la flexibilité des solutions logicielles conventionnelles, ce qui est unique dans l'état de l'art.Cette approche est validée par la conception et l'implémentation d'une architecture de traitement de paquets flexible sur FPGA. Celle-ci peut traiter n'importe quel type de paquet au coût d'un faible surplus de consommation de ressources. Elle est de plus complètement paramétrable à partir du logiciel. La solution proposée permet ainsi un usage transparent de la puissance d'un accélérateur matériel par un ingénieur réseau sans nécessiter de compétence préalable en conception de circuits numériques
The reliability and the security of communication networks require efficient components to finely analyze the traffic of data. Service diversification and through put increase force network operators to constantly improve analysis systems in order to handle through puts of hundreds,even thousands of Gigabits per second. Commonly used solutions are software oriented solutions that offer a flexibility and an accessibility welcome for network operators, but they can no more answer these strong constraints in many critical cases.This thesis studies architectural solutions based on programmable chips like Field-Programmable Gate Arrays (FPGAs) combining computation power and processing flexibility. Boards equipped with such chips are integrated into a common software/hardware processing flow in order to balance short comings of each element. Network components developed with this innovative approach ensure an exhaustive processing of packets transmitted on physical links while keeping the flexibility of usual software solutions, which was never encountered in the previous state of theart.This approach is validated by the design and the implementation of a flexible packet processing architecture on FPGA. It is able to process any packet type at the cost of slight resources over consumption. It is moreover fully customizable from the software part. With the proposed solution, network engineers can transparently use the processing power of an hardware accelerator without the need of prior knowledge in digital circuit design

APA, Harvard, Vancouver, ISO, and other styles

34

Quintal, Luis Fernando Curi. "The applicability of hardware design strategies to improve software application performance in multi-core architectures." Thesis, University of Reading, 2014. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.628531.

Full text

Abstract:

Multi-core architectures have become the main trend in the last decade in microprocessor design in order to deliver increasing computing power in computing systems. This trend in microprocessor architecture is leading towards the increment of the number of processing cores in the same die, thanks to transistor miniaturisation, with tens to hundreds of cores integrated in a system in the near future. Moreover, multi-core architectures have become pervasive in computing devices at all levels, from smartphones and tablets to multi-user servers and supercomputers. A multi-core architecture poses a challenge in software application development, since sequential applications can no longer benefit from new multi-core generations. Alternative programming models, tools and algorithms are needed in order to efficiently exploit computing power inherent in multi-core architectures. This thesis presents map-merge, a parallel processing method to formulate parallel versions of software applications that classify and process large input data arrays. Map-merge uses and advantages are illustrated with a parallel formulation of a generic bucket sort algorithm, which gives a peak speedup gain of 9 when it is executed in a dual 6-core system; and also with a parallel formulation of a branch predictor simulator, which gives a peak speedup gain of 7 in the same multi-core system. In addition, two methods, based on bit-representation and bit-manipulation strategies, are presented: bit-slice as an alternative algorithm to rank elements in a set, and bit-index as an alternative algorithm to sort a set of integers. The bit-slice method is implemented as an application to compute the median from a set of integers. This implementation outperformed median calculation versions by up to 6 times. These versions are based on two sorting algorithms: quicksort and counting sort. On the other hand, the bit-index method is implemented as a sorting algorithm for permutations of a set of integers. The approach also outperformed quicksort and counting sort implementations with peak speedup gains of 10 and 6 respectively. These methods are inspired by traditional

APA, Harvard, Vancouver, ISO, and other styles

35

Suljevic, Benjamin. "Mapping HW resource usage towards SW performance." Thesis, Mälardalens högskola, Akademin för innovation, design och teknik, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:mdh:diva-44176.

Full text

Abstract:

With the software applications increasing in complexity, description of hardware is becoming increasingly relevant. To ensure the quality of service for specific applications, it is imperative to have an insight into hardware resources. Cache memory is used for storing data closer to the processor needed for quick access and improves the quality of service of applications. The description of cache memory usually consists of the size of different cache levels, set associativity, or line size. Software applications would benefit more from a more detailed model of cache memory.In this thesis, we offer a way of describing the behavior of cache memory which benefits software performance. Several performance events are tested, including L1 cache misses, L2 cache misses, and L3 cache misses. With the collected information, we develop performance models of cache memory behavior. Goodness of fit is tested for these models and they are used to predict the behavior of the cache memory during future runs of the same application.Our experiments show that L1 cache misses can be modeled to predict the future runs. L2 cache misses model is less accurate but still usable for predictions, and L3 cache misses model is the least accurate and is not feasible to predict the behavior of the future runs.

APA, Harvard, Vancouver, ISO, and other styles

36

Hill, Terrance, Mark Geoghegan, and Kevin Hutzel. "IMPLEMENTATION AND PERFORMANCE OF A HIGHSPEED, VHDL-BASED, MULTI-MODE ARTM DEMODULATOR." International Foundation for Telemetering, 2002. http://hdl.handle.net/10150/606325.

Full text

Abstract:

International Telemetering Conference Proceedings / October 21, 2002 / Town & Country Hotel and Conference Center, San Diego, California
Legacy telemetry systems, although widely deployed, are being severely taxed to support the high data rate requirements of advanced aircraft and missile platforms. Increasing data rates, in conjunction with loss of spectrum have created a need to use available spectrum more efficiently. In response to this, new modulation techniques have been developed which offer more data capacity in the same operating bandwidth. Demodulation of these new waveforms is a computationally challenging task, especially at high data rates. This paper describes the design, implementation and performance of a high-speed, multi-mode demodulator for the Advanced Range Telemetry (ARTM) program which meets these challenges.

APA, Harvard, Vancouver, ISO, and other styles

37

Damasceno, Costa Diego Elias [Verfasser], and Artur [Akademischer Betreuer] Andrzejak. "Benchmark-driven Software Performance Optimization / Diego Elias Damasceno Costa ; Betreuer: Artur Andrzejak." Heidelberg : Universitätsbibliothek Heidelberg, 2019. http://d-nb.info/1192373170/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

38

Damasceno, Costa Diego Elias Verfasser], and Artur [Akademischer Betreuer] [Andrzejak. "Benchmark-driven Software Performance Optimization / Diego Elias Damasceno Costa ; Betreuer: Artur Andrzejak." Heidelberg : Universitätsbibliothek Heidelberg, 2019. http://nbn-resolving.de/urn:nbn:de:bsz:16-heidok-269197.

Full text

APA, Harvard, Vancouver, ISO, and other styles

39

Fong, Fredric, and Mustafa Raed. "Performance comparison of GraalVM, Oracle JDK andOpenJDK for optimization of test suite execution time." Thesis, Mittuniversitetet, Institutionen för data- och systemvetenskap, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:miun:diva-43169.

Full text

Abstract:

Testing, when done correctly, is an important part of software development sinceit is a measure of the quality of a software in question. Most of the highly ratedsoftware projects therefore have test suites implemented that include unit tests,integration tests, and other types of tests. However, a challenge regarding the testsuite is that it needs to run each time new code changes are proposed. From thedeveloper’s perspective, it might not always be necessary to run the whole testsuite for small code changes. Previous studies have tried to tackle this probleme.g., by only running a subset of the test suite. This research investigates runningthe whole test suite of Java projects faster, by testing the Java Development Kits(JDKs) GraalVM Enterprise Edition (EE) and Community Edition (CE) againstOracle JDK and OpenJDK for Java 8 and 11. The research used the test suiteexecution time as a metric to compare the JDKs. Another metric that wasconsidered was the test suites number of test cases, used to try and find a breakingpoint for when GraalVM becomes beneficial. The tests were performed on twotest machines, where the first used 20 out of 48 tested projects and the secondused 11 out of 43 projects tested. When looking at the average of five runs,GraalVM EE 11 performed best in 11 out of 18 projects on the first test machine,compared to its closest competitor, and in 7 out of 11 projects on the second testmachine both for JDK 8 and 11. However GraalVM EE 8 did not give anybenefits to the first test machine compared to its competitors, which might indicatethat the hardware plays a vital role in the performance of GraalVM EE 8. Numberof test cases could not be used to determine a breaking point for when GraalVM isbeneficial, but it was observed that GraalVM did not show any benefits forprojects with an execution time of fewer than 39 seconds. It is observed thatGraalVM CE, does not perform well as compared to the other JDKs, and in allcases, its performance is not countable due to less non-satisfied and inefficientbehavior.

APA, Harvard, Vancouver, ISO, and other styles

40

Hou, Wei. "Integrated Reliability and Availability Aanalysis of Networks With Software Failures and Hardware Failures." Scholar Commons, 2003. https://scholarcommons.usf.edu/etd/1393.

Full text

Abstract:

This dissertation research attempts to explore efficient algorithms and engineering methodologies of analyzing the overall reliability and availability of networks integrated with software failures and hardware failures. Node failures, link failures, and software failures are concurrently and dynamically considered in networks with complex topologies. MORIN (MOdeling Reliability for Integrated Networks) method is proposed and discussed as an approach for analyzing reliability of integrated networks. A Simplified Availability Modeling Tool (SAMOT) is developed and introduced to evaluate and analyze the availability of networks consisting of software and hardware component systems with architectural redundancy. In this dissertation, relevant research efforts in analyzing network reliability and availability are reviewed and discussed, experimental data results of proposed MORIN methodology and SAMOT application are provided, and recommendations for future researches in the network reliability study are summarized as well.

APA, Harvard, Vancouver, ISO, and other styles

41

Hou, Wei. "Integrated reliability and availability analysis of networks with software failures and hardware failures." [Tampa, Fla.] : University of South Florida, 2003. http://purl.fcla.edu/fcla/etd/SFE0000173.

Full text

APA, Harvard, Vancouver, ISO, and other styles

42

Weng, Lichen. "A Hardware and Software Integrated Approach for Adaptive Thread Management in Multicore Multithreaded Microprocessors." FIU Digital Commons, 2012. http://digitalcommons.fiu.edu/etd/653.

Full text

Abstract:

The Multicore Multithreaded Microprocessor maximizes parallelism on a chip for the optimal system performance, such that its popularity is growing rapidly in high-performance computing. It increases the complexity in resource distribution on a chip by leading it to two directions: isolation and unification. On one hand, multiple cores are implemented to deliver the computation and memory accessing resources to more than one thread at the same time. Nevertheless, it limits the threads’ access to resources in different cores, even if extensively demanded. On the other hand, simultaneous multithreaded architectures unify the domestic execu- tion resources together for concurrently running threads. In such an environment, threads are greatly affected by the inter-thread interference. Moreover, the impacts of the complicated distribution are enlarged by variation in workload behaviors. As a result, the microprocessor requires an adaptive management scheme to schedule threads throughout different cores and coordinate them within cores. In this study, an adaptive thread management scheme was proposed, integrating both hardware and software approaches. The instruction fetch policy at the hardware level took the responsibility by prioritizing domestic threads, while the Operating System scheduler at the software level was used to pair threads dynami- vi cally to multiple cores. The tie between them was the proposed online linear model, which was dynamically constructed for every thread based on data misses by the regression algorithm. Consequently, the hardware part of the proposed scheme proactively granted higher priority to the threads with less predicted long-latency loads, expecting they would better utilize the shared execution resources. Mean- while, the software part was invoked by such a model upon significant changes in the execution phases and paired threads with different demands to the same core to minimize competition on the chip. The proposed scheme was compared to its peer designs and overall 43% speedup was achieved by the integrated approach over the combination of two baseline policies in hardware and software, respectively. The overhead was examined carefully regarding power, area, storage and latency, as well as the relationship between the overhead and the performance.

APA, Harvard, Vancouver, ISO, and other styles

43

Lewerentz, Andreaz, and Jonathan Lindvall. "Performance and Energy Optimization for the Android Platform." Thesis, Blekinge Tekniska Högskola, Sektionen för datavetenskap och kommunikation, 2012. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-4490.

Full text

Abstract:

Software developers are faced with several challenges when creating applications for the new generation of mobile devices. Smartphones and tablets have limited processing power and memory resources, and a small battery is the only thing that keeps the hardware components running. End users have little patience for slow applications that drain the batteries of their devices. To satisfy the needs of their customers, developers must take these hardware limitations into account; they must make an effort to optimize the performance and energy efficiency of their applications. This thesis provides a general overview of performance and energy optimization in the mobile domain. A specific sub-area is explored in great detail: the use of native C code for performance and energy optimization of Android applications. An experiment was conducted to see how the performance of native code compares to that of Java. This is the first time that such measurements have been made on both emulators and physical devices. The devices were running recent versions of Android that have not been used for similar experiments before: 2.3.3, 3.2 and 4.0.3. It is also the first time that native code has been compared to Java in terms of energy consumption. The results show that the latest updates to the Android platform have brought Java closer to native code in terms of performance, but native code is still the best choice for certain types of operations. It is also evident that there is a close correlation between performance and energy efficiency. Finally, the results show that Android emulators are unreliable for performance measurements. This could be a reason to question the validity of previous research.

APA, Harvard, Vancouver, ISO, and other styles

44

Chen, Kuan-Hsun [Verfasser], Jian-Jia [Akademischer Betreuer] Chen, and Rolf [Gutachter] Ernst. "Optimization and analysis for dependable application software on unreliable hardware platforms / Kuan-Hsun Chen ; Gutachter: Rolf Ernst ; Betreuer: Jian-Jia Chen." Dortmund : Universitätsbibliothek Dortmund, 2019. http://d-nb.info/1189420333/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

45

Ledinov, Dmytro. "UpTime 4 - Health Monitoring Component." Thesis, Linnéuniversitetet, Institutionen för datavetenskap (DV), 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-31343.

Full text

APA, Harvard, Vancouver, ISO, and other styles

46

Sargur, Sudarshan Lakshminarasimhan. "An Efficient Architecture for Dynamic Profiling of Multicore Systems." Thesis, The University of Arizona, 2015. http://hdl.handle.net/10150/595814.

Full text

Abstract:

Application profiling is an important step in the design and optimization of embedded systems. Accurately identifying and analyzing the execution of frequently executed computational kernels is needed to effectively optimize the system implementation, both at design time and runtime. In a traditional design process, it suffices to perform the profiling and optimization steps offline, during design time. The offline profiling guides the design space exploration, hardware software codesign, or power and performance optimizations. When the system implementation can be finalized at design time, this approach works well. However, dynamic optimization techniques, which adapt and reconfigure the system at runtime, require dynamic profiling with minimum runtime overheads. Existing profiling methods are usually software based and incur significant overheads that may be prohibitive or impractical for profiling embedded systems at runtime. In addition, these profiling methods typically focus on profiling the execution of specific tasks executing on a single processor core, but do not consider accurate and holistic profiling across multiple processor cores. Directly utilizing existing profiling approaches and naively combining isolated profiles from multiple processor cores can lead to significant profile inaccuracies of up to 35%. To address these challenges, a hardware-based dynamic application profiler for non-intrusively and accurately profiling software applications in multicore embedded systems is presented. The profiler provides a detailed execution profile for computational kernels and maintains profile accuracy across multiple processor cores. The hardware-based profiler achieves an average error of less than 0.5% for the percentage execution time of profiled applications while being area efficient.

APA, Harvard, Vancouver, ISO, and other styles

47

Belsick, Charlotte Ann. "Space Vehicle Testing." DigitalCommons@CalPoly, 2012. https://digitalcommons.calpoly.edu/theses/888.

Full text

Abstract:

Requirement verification and validation is a critical component of building and delivering space vehicles with testing as the preferred method. This Master’s Project presents the space vehicle test process from planning through test design and execution. It starts with an overview of the requirements, validation, and verification. The four different verification methods are explained including examples as to what can go wrong if the verification is done incorrectly. Since the focus of this project is on test, test verification is emphasized. The philosophy behind testing, including the “why” and the methods, is presented. The different levels of testing, the test objectives, and the typical tests are discussed in detail. Descriptions of the different types of tests are provided including configurations and test challenges. While most individuals focus on hardware only, software is an integral part of any space product. As such, software testing, including mistakes and examples, is also presented. Since testing is often not performed flawlessly the first time, sections on anomalies, including determining root cause, corrective action, and retest is included. A brief discussion of defect detection in test is presented. The project is actually presented in total in the Appendix as a Power Point document.

APA, Harvard, Vancouver, ISO, and other styles

48

Beaugnon, Ulysse. "Efficient code generation for hardware accelerators by refining partially specified implementation." Thesis, Paris Sciences et Lettres (ComUE), 2019. http://www.theses.fr/2019PSLEE050.

Full text

Abstract:

Les compilateurs cherchant à améliorer l’efficacité des programmes doivent déterminer quelles optimisations seront les plus bénéfiques. Ce problème est complexe, surtout lors des premières étapes de la compilation où chaque décision influence les choix disponibles aux étapes suivantes. Nous proposons de représenter la compilation comme le raffinement progressif d’une implémentation partiellement spécifiée. Les décisions possibles sont toutes connues dès le départ et commutent. Cela permet de prendre les décisions les plus importantes en premier et de construire un modèle de performance capable d'anticiper les potentielles optimisations. Nous appliquons cette approche pour générer du code d'algèbre linéaire ciblant des GPU et obtenons des performances comparables aux bibliothèques optimisées à la main
Compilers looking for an efficient implementation of a function must find which optimizations are the most beneficial. This is a complex problem, especially in the early steps of the compilation process. Each decision may impact the transformations available in subsequent steps. We propose to represent the compilation process as the progressive refinement of a partially specified implementation. All potential decisions are exposed upfront and commute. This allows for making the most discriminative decisions first and for building a performance model aware of which optimizations may be applied in subsequent steps. We apply this approach to the generation of efficient GPU code for linear algebra and yield performance competitive with hand-tuned libraries

APA, Harvard, Vancouver, ISO, and other styles

49

Hayes, Brian C. "Performance oriented scheduling with power constraints." [Tampa, Fla.] : University of South Florida, 2005. http://purl.fcla.edu/fcla/etd/SFE0001073.

Full text

APA, Harvard, Vancouver, ISO, and other styles

50

Varma, Krishnaraj M. "Fast Split Arithmetic Encoder Architectures and Perceptual Coding Methods for Enhanced JPEG2000 Performance." Diss., Virginia Tech, 2006. http://hdl.handle.net/10919/26519.

Full text

Abstract:

JPEG2000 is a wavelet transform based image compression and coding standard. It provides superior rate-distortion performance when compared to the previous JPEG standard. In addition JPEG2000 provides four dimensions of scalability-distortion, resolution, spatial, and color. These superior features make JPEG2000 ideal for use in power and bandwidth limited mobile applications like urban search and rescue. Such applications require a fast, low power JPEG2000 encoder to be embedded on the mobile agent. This embedded encoder needs to also provide superior subjective quality to low bitrate images. This research addresses these two aspects of enhancing the performance of JPEG2000 encoders. The JPEG2000 standard includes a perceptual weighting method based on the contrast sensitivity function (CSF). Recent literature shows that perceptual methods based on subband standard deviation are also effective in image compression. This research presents two new perceptual weighting methods that combine information from both the human contrast sensitivity function as well as the standard deviation within a subband or code-block. These two new sets of perceptual weights are compared to the JPEG2000 CSF weights. The results indicate that our new weights performed better than the JPEG2000 CSF weights for high frequency images. Weights based solely on subband standard deviation are shown to perform worse than JPEG2000 CSF weights for all images at all compression ratios. Embedded block coding, EBCOT tier-1, is the most computationally intensive part of the JPEG2000 image coding standard. Past research on fast EBCOT tier-1 hardware implementations has concentrated on cycle efficient context formation. These pass-parallel architectures require that JPEG2000's three mode switches be turned on. While turning on the mode switches allows for arithmetic encoding from each coding pass to run independent of each other (and thus in parallel), it also disrupts the probability estimation engine of the arithmetic encoder, thus sacrificing coding efficiency for improved throughput. In this research a new fast EBCOT tier-1 design is presented: it is called the Split Arithmetic Encoder (SAE) process. The proposed process exploits concurrency to obtain improved throughput while preserving coding efficiency. The SAE process is evaluated using three methods: clock cycle estimation, multithreaded software implementation, a field programmable gate array (FPGA) hardware implementation. All three methods achieve throughput improvement; the hardware implementation exhibits the largest speedup, as expected. A high speed, task-parallel, multithreaded, software architecture for EBCOT tier-1 based on the SAE process is proposed. SAE was implemented in software on two shared-memory architectures: a PC using hyperthreading and a multi-processor non-uniform memory access (NUMA) machine. The implementation adopts appropriate synchronization mechanisms that preserve the algorithm's causality constraints. Tests show that the new architecture is capable of improving throughput as much as 50% on the NUMA machine and as much as 19% on a PC with two virtual processing units. A high speed, multirate, FPGA implementation of the SAE process is also proposed. The mismatch between the rate of production of data by the context formation (CF) module and the rate of consumption of data by the arithmetic encoder (AE) module is studied in detail. Appropriate choices for FIFO sizes and FIFO write and read capabilities are made based on the statistics obtained from test runs of the algorithm. Using a fast CF module, this implementation was able to achieve as much as 120% improvement in throughput.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Performance Optimization in Software and Hardware'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles