Dissertations / Theses: 'Coarse grained architecture'

1

Guo, Yuanqing. "Mapping applications to a coarse-grained reconfigurable architecture." Enschede : University of Twente [Host], 2006. http://doc.utwente.nl/57121.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Lee, Jong-Suk Mark. "FleXilicon: a New Coarse-grained Reconfigurable Architecture for Multimedia and Wireless Communications." Diss., Virginia Tech, 2010. http://hdl.handle.net/10919/77094.

Full text

Abstract:

High computing power and flexibility are important design factors for multimedia and wireless communication applications due to the demand for high quality services and frequent evolution of standards. The ASIC (Application Specific Integrated Circuit) approach provides an area efficient, high performance solution, but is inflexible. In contrast, the general purpose processor approach is flexible, but often fails to provide sufficient computing power. Reconfigurable architectures, which have been introduced as a compromise between the two extreme solutions, have been applied successfully for multimedia and wireless communication applications. In this thesis, we investigated a new coarse-grained reconfigurable architecture called FleXilicon which is designed to execute critical loops efficiently, and is embedded in an SOC with a host processor. FleXilicon improves resource utilization and achieves a high degree of loop level parallelism (LLP). The proposed architecture aims to mitigate major shortcomings with existing architectures through adoption of three schemes, (i) wider memory bandwidth, (ii) adoption of a reconfigurable controller, and (iii) flexible wordlength support. Increased memory bandwidth satisfies memory access requirement in LLP execution. New design of reconfigurable controller minimizes overhead in reconfiguration and improves area efficiency and reconfiguration overhead. Flexible word-length support improves LLP by increasing the number of processing elements executable. The simulation results indicate that FleXilicon reduces the number of clock cycles and increases the speed for all five applications simulated. The speedup ratios compared with conventional architectures are as large as two orders of magnitude for some applications. VLSI implementation of FleXilicon in 65 nm CMOS process indicates that the proposed architecture can operate at a high frequency up to 1 GHz with moderate silicon area.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

3

Malik, Omer. "Pragma-Based Approach For Mapping DSP Functions On A Coarse Grained Reconfigurable Architecture." Licentiate thesis, KTH, Elektronik och Inbyggda System, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-166410.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Yang, Yu. "BENCHMARK OF TRIGGERED INSTRUCTION BASED COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR RADIO BASE STATION." Thesis, KTH, Skolan för informations- och kommunikationsteknik (ICT), 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-177446.

Full text

Abstract:

Spatially-programmed architectures such as FPGA are among the most prevailing hardware in various application areas. However FPGA suffers from great overheads such as area, latency and power efficiency. Coarse-grained Reconfigurable Architecture (CGRA) is designed in order to compensate these disadvantages of FPGA. In this thesis, a Triggered Instruction based novel CGRA designed by Intel is evaluated. Benchmark work in this thesis focuses on signal processing area. Three performance limiting functions, Channel Estimation, Radix-2 FFT and Interleaving are selected from LTE Uplink Receiver PHY Benchmark which is an open source benchmark, and implemented and analyzed in Triggered Instruction Architecture (TIA). Throughput-area relationships and throughput/area-area relationships are summarized in curves using a resource estimation method. The benchmark result shows that TIA offers good flexibility for temporal and spatial execution, and a mix of them. Designs in TIA are scalable and adjustable according to different performance requirement. Moreover, based on the development work, this thesis discusses development flow of TIA, various programming techniques, low latency mapping solutions, code size comparison, development environment and integration of heterogeneous system with TIA.

APA, Harvard, Vancouver, ISO, and other styles

5

Zhao, Xin. "High efficiency coarse-grained customised dynamically reconfigurable architecture for digital image processing and compression technologies." Thesis, University of Edinburgh, 2012. http://hdl.handle.net/1842/6187.

Full text

Abstract:

Digital image processing and compression technologies have significant market potential, especially the JPEG2000 standard which offers outstanding codestream flexibility and high compression ratio. Strong demand for high performance digital image processing and compression system solutions is forcing designers to seek proper architectures that offer competitive advantages in terms of all performance metrics, such as speed and power. Traditional architectures such as ASIC, FPGA and DSPs have limitations in either low flexibility or high power consumption. On the other hand, through the provision of a degree of flexibility similar to that of a DSP and performance and power consumption advantages approaching that of an ASIC, coarse-grained dynamically reconfigurable architectures are proving to be strong candidates for future high performance digital image processing and compression systems. This thesis investigates dynamically reconfigurable architectures and especially the newly emerging RICA paradigm. Case studies such as Reed- Solomon decoder and WiMAX OFDM timing synchronisation engine are implemented in order to explore the potential of RICA-based architectures and the possible optimisation approaches such as eliminating conditional branches, reducing memory accesses and constructing kernels. Based on investigations in this thesis, a novel customised dynamically reconfigurable architecture targeting digital image processing and compression applications is devised, which can be tailored to adopt different applications. A demosaicing engine based on the Freeman algorithm is designed and implemented on the proposed architecture as the pre-processing module in a digital imaging system. An efficient data buffer rotating scheme is designed with the aim of reducing memory accesses. Meanwhile an investigation targeting mapping the demosaicing engine onto a dual-core RICA platform is performed. After optimisation, the performance of the proposed engine is carefully evaluated and compared in aspects of throughput and consumed computational resources. When targeting the JPEG2000 standard, the core tasks such as 2-D Discrete Wavelet Transform (DWT) and Embedded Block Coding with Optimal Truncation (EBCOT) are implemented and optimised on the proposed architecture. A novel 2-D DWT architecture based on vector operations associated with RICA paradigm is developed, and the complete DWT application is highly optimised for both throughput and area. For the EBCOT implementation, a novel Partial Parallel Architecture (PPA) for the most computationally intensive module in EBCOT, termed Context Modeling (CM), is devised. Based on the algorithm evaluation, an ARM core is integrated into the proposed architecture for performance enhancement. A Ping-Pong memory switching mode with carefully designed communication scheme between RICA based architecture and ARM is proposed. Simulation results demonstrate that the proposed architecture for JPEG2000 offers significant advantage in throughput.

APA, Harvard, Vancouver, ISO, and other styles

6

Kattah, Senira da Silva. "Controls on deposition and resulting stratal architecture of coarse-grained alluvial and near-shore facies associations /." Digital version accessible at:, 1999. http://wwwlib.umi.com/cr/utexas/main.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

Saraswat, Rohit. "A Finite Domain Constraint Approach for Placement and Routing of Coarse-Grained Reconfigurable Architectures." DigitalCommons@USU, 2010. https://digitalcommons.usu.edu/etd/689.

Full text

Abstract:

Scheduling, placement, and routing are important steps in Very Large Scale Integration (VLSI) design. Researchers have developed numerous techniques to solve placement and routing problems. As the complexity of Application Specific Integrated Circuits (ASICs) increased over the past decades, so did the demand for improved place and route techniques. The primary objective of these place and route approaches has typically been wirelength minimization due to its impact on signal delay and design performance. With the advent of Field Programmable Gate Arrays (FPGAs), the same place and route techniques were applied to FPGA-based design. However, traditional place and route techniques may not work for Coarse-Grained Reconfigurable Architectures (CGRAs), which are reconfigurable devices offering wider path widths than FPGAs and more flexibility than ASICs, due to the differences in architecture and routing network. Further, the routing network of several types of CGRAs, including the Field Programmable Object Array (FPOA), has deterministic timing as compared to the routing fabric of most ASICs and FPGAs reported in the literature. This necessitates a fresh look at alternative approaches to place and route designs. This dissertation presents a finite domain constraint-based, delay-aware placement and routing methodology targeting an FPOA. The proposed methodology takes advantage of the deterministic routing network of CGRAs to perform a delay aware placement.

APA, Harvard, Vancouver, ISO, and other styles

8

Bozetti, Guilherme. "Stratigraphy and architecture of a coarse-grained deep-water system within the Cretaceous Cerro Toto formation, Silla Syncline area, southern Chile." Thesis, University of Aberdeen, 2017. http://digitool.abdn.ac.uk:80/webclient/DeliveryManager?pid=235577.

Full text

Abstract:

The Upper Cretaceous Cerro Toro Formation, southern Chile, is characterised by thinbedded turbidites that envelope a series of coarse-grained, confined slope complex systems, interpreted as part of the Lago Sofia Member. This deep-water slope system overlies basin floor sheets of the Punta Barrosa Formation, and is overlain by the sand-filled slope channels of the Tres Pasos Formation. Particularly distinctive beds, known as TEDs (transitional event deposits), are up to 40 m thick, laterally extensive, have prominent fluted bases, and have a vertical fabric starting with (1) a thin, inversely-graded, clast-supported base; then (2) a normally-graded and clastsupported interval; (3) an increasingly sand and clay matrix-supported conglomerate, with (4) a progressive upwards increase in matrix and normally grading, both in the floating gravel clast and matrix grain sizes, towards the top; and (5) a co-genetic sandstone on top. In the Cerro Toro formation, these TEDs tend to occur as multiple beds in the initial phases of deposition of each channel complex system. The TEDs are highly aggradational, slightly more amalgamated in the channel-axis, and more layered towards the margins. The fabric of these spectacular event beds is described in some detail from measured sections, combined with petrographic analysis and high-resolution field mapping. The 4 km x 200 m channel systems are contained within topographically irregular bathymetric lows that formed sediment pathways, interpreted to be either the result of slope deformation, or contained by poorly preserved, tectonically disrupted or slumped external levee. Syn-sedimentary tectonism is interpreted to be responsible for sharp changes in the system's architecture from channels to ponds, marked by a sharp change in lithofacies from dominantly conglomerates to dominantly sandstones. A refined architectural analysis is proposed, focusing on the recurrent pattern of at least 5 cycles of conglomerate-filled channel systems – ponded sheet sandstones.

APA, Harvard, Vancouver, ISO, and other styles

9

Tuitt, Natasha R. T. "4D interpretation of texture and architecture of a coarse grained slope channel system using automated statistics from high resolution outcrop photography." Thesis, University of Aberdeen, 2014. http://digitool.abdn.ac.uk:80/webclient/DeliveryManager?pid=218284.

Full text

Abstract:

The building blocks of a sedimentary system are essential inputs into studies of reservoir character and comparisons with other sedimentary systems. Yet, our current knowledge of the building blocks of deep water slope channel deposits is still largely speculative. A quantitative approach has been utilised in order to analyse a host of lithological data and objectively identify these sedimentary components. The laterally-extensive and gently-dipping continental slope deposits of the San Fernando Channel System, Baja California, provide the required control on sub-seismic-scale temporal and lateral variations of lithofacies and 3D architecture. High resolution photo-panoramas (with better than 2mm accuracy) of the prominent conglomeratic component of the succession were collected from various key parts of the outcrop. Image analysis of segments extracted from the photo-panoramas generates key parameters for comparison of texture and fabric of conglomerates, such as clast to matrix ratio, major axis length and relative orientation. Statistical analysis of these data enabled the erection of an objective lithofacies scheme for the gravel fraction, the grouping of lithofacies into objectively-defined assemblages, and the establishment of models for the lateral and stratigraphic distributions of these assemblages. 12 lithofacies were objectively identified through hierarchical cluster analysis of 4 quantitative lithological parameters. Statistical analyses indicate significant differences in diversity in the lithofacies assemblages between the early and later parts (termed Stage 1 and Stage 2) of a channel complex set (sensu Sprague, et al., 2002), and to a lesser extent between marginal and axial parts of the system. These can be related to spatial differences and temporal changes in the nature of the turbidity currents flowing through the channel system. Gravelly units become more organised and less diverse with time in one CCS, and each successive CCS more organised at earlier stratigraphic levels than the next, except for the last CCS which is interpreted as influenced by a tectonic paroxysm. These seemingly autocyclic changes in organisation are interpreted as process-responses to changes in equilibrium profile as the nature of confinement changes with the infilling of an initial erosional confinement, to confinement by a master levee and gradual infilling through the evolution of each CCS.

APA, Harvard, Vancouver, ISO, and other styles

10

Das, Satyajit. "Architecture and Programming Model Support for Reconfigurable Accelerators in Multi-Core Embedded Systems." Thesis, Lorient, 2018. http://www.theses.fr/2018LORIS490/document.

Full text

Abstract:

La complexité des systèmes embarqués et des applications impose des besoins croissants en puissance de calcul et de consommation énergétique. Couplé au rendement en baisse de la technologie, le monde académique et industriel est toujours en quête d'accélérateurs matériels efficaces en énergie. L'inconvénient d'un accélérateur matériel est qu'il est non programmable, le rendant ainsi dédié à une fonction particulière. La multiplication des accélérateurs dédiés dans les systèmes sur puce conduit à une faible efficacité en surface et pose des problèmes de passage à l'échelle et d'interconnexion. Les accélérateurs programmables fournissent le bon compromis efficacité et flexibilité. Les architectures reconfigurables à gros grains (CGRA) sont composées d'éléments de calcul au niveau mot et constituent un choix prometteur d'accélérateurs programmables. Cette thèse propose d'exploiter le potentiel des architectures reconfigurables à gros grains et de pousser le matériel aux limites énergétiques dans un flot de conception complet. Les contributions de cette thèse sont une architecture de type CGRA, appelé IPA pour Integrated Programmable Array, sa mise en œuvre et son intégration dans un système sur puce, avec le flot de compilation associé qui permet d'exploiter les caractéristiques uniques du nouveau composant, notamment sa capacité à supporter du flot de contrôle. L'efficacité de l'approche est éprouvée à travers le déploiement de plusieurs applications de traitement intensif. L'accélérateur proposé est enfin intégré à PULP, a Parallel Ultra-Low-Power Processing-Platform, pour explorer le bénéfice de ce genre de plate-forme hétérogène ultra basse consommation
Emerging trends in embedded systems and applications need high throughput and low power consumption. Due to the increasing demand for low power computing and diminishing returns from technology scaling, industry and academia are turning with renewed interest toward energy efficient hardware accelerators. The main drawback of hardware accelerators is that they are not programmable. Therefore, their utilization can be low is they perform one specific function and increasing the number of the accelerators in a system on chip (SoC) causes scalability issues. Programmable accelerators provide flexibility and solve the scalability issues. Coarse-Grained Reconfigurable Array (CGRA) architecture consisting of several processing elements with word level granularity is a promising choice for programmable accelerator. Inspired by the promising characteristics of programmable accelerators, potentials of CGRAs in near threshold computing platforms are studied and an end-to-end CGRA research framework is developed in this thesis. The major contributions of this framework are: CGRA design, implementation, integration in a computing system, and compilation for CGRA. First, the design and implementation of a CGRA named Integrated Programmable Array (IPA) is presented. Next, the problem of mapping applications with control and data flow onto CGRA is formulated. From this formulation, several efficient algorithms are developed using internal resources of a CGRA, with a vision for low power acceleration. The algorithms are integrated into an automated compilation flow. Finally, the IPA accelerator is augmented in PULP - a Parallel Ultra-Low-Power Processing-Platform to explore heterogeneous computing

APA, Harvard, Vancouver, ISO, and other styles

11

Peyret, Thomas. "Architecture matérielle et flot de programmation associé pour la conception de systèmes numériques tolérants aux fautes." Thesis, Lorient, 2014. http://www.theses.fr/2014LORIS348/document.

Full text

Abstract:

Que ce soit dans l’automobile avec des contraintes thermiques ou dans l’aérospatial et lenucléaire soumis à des rayonnements ionisants, l’environnement entraîne l’apparition de fautesdans les systèmes électroniques. Ces fautes peuvent être transitoires ou permanentes et vontinduire des résultats erronés inacceptables dans certains contextes applicatifs. L’utilisation decomposants dits « rad-hard » est parfois compromise par leurs coûts élevés ou les difficultésd’approvisionnement liés aux règles d’exportation.Cette thèse propose une approche conjointe matérielle et logicielle indépendante de la technologied’intégration permettant d’utiliser des composants numériques programmables dans desenvironnements susceptibles de générer des fautes. Notre proposition comporte la définitiond’une Architecture Reconfigurable à Gros Grains (CGRA) capable d’exécuter des codes applicatifscomplets mais aussi l’ensemble des mécanismes matériels et logiciels permettant de rendrecette architecture tolérante aux fautes. Ce résultat est obtenu par l’association de redondance etde reconfiguration dynamique du CGRA en s’appuyant sur une banque de configurations généréepar une chaîne de programmation complète. Cette chaîne outillée repose sur un flot permettantde porter un code sous forme de Control and Data Flow Graph (CDFG) sur l’architecture enobtenant un grand nombre de configurations différentes et qui permet d’exploiter au mieux lepotentiel de l’architecture.Les travaux, qui ont été validés aux travers d’expériences sur des applications du domaine dutraitement du signal et de l’image, ont fait l’objet de publications en conférences internationaleset de dépôts de brevets
Whether in automotive with heat stress or in aerospace and nuclear field subjected to cosmic,neutron and gamma radiation, the environment can lead to the development of faults in electronicsystems. These faults, which can be transient or permanent, will lead to erroneous results thatare unacceptable in some application contexts. The use of so-called rad-hard components issometimes compromised due to their high costs and supply problems associated with exportrules.This thesis proposes a joint hardware and software approach independent of integrationtechnology for using digital programmable devices in environments that generate faults. Ourapproach includes the definition of a Coarse Grained Reconfigurable Architecture (CGRA) ableto execute entire application code but also all the hardware and software mechanisms to make ittolerant to transient and permanent faults. This is achieved by the combination of redundancyand dynamic reconfiguration of the CGRA based on a library of configurations generated by acomplete conception flow. This implemented flow relies on a flow to map a code represented as aControl and Data Flow Graph (CDFG) on the CGRA architecture by obtaining directly a largenumber of different configurations and allows to exploit the full potential of architecture.This work, which has been validated through experiments with applications in the field ofsignal and image processing, has been the subject of two publications in international conferencesand of two patents

APA, Harvard, Vancouver, ISO, and other styles

12

Zain-ul-Abdin. "Programming of coarse-grained reconfigurable architectures." Doctoral thesis, Örebro universitet, Akademin för naturvetenskap och teknik, 2011. http://urn.kb.se/resolve?urn=urn:nbn:se:oru:diva-15246.

Full text

Abstract:

Coarse-grained reconfigurable architectures, which offer massive parallelism coupled with the capability of undergoing run-time reconfiguration, are gaining attention in order to meet not only the increased computational demands of high-performance embedded systems, but also to fulfill the need of adaptability to functional requirements of the application. This thesis focuses on the programming aspects of such coarse-grained reconfigurable computing devices, including the relevant computation models that are capable of exposing different kinds of parallelism inherent in the application and the ability of these models to capture the adaptability requirements of the application. The thesis suggests the occam-pi language for programming of a broad class of coarse-grained reconfigurable architectures as an intermediate language; we call it intermediate, since we believe that the applicationprogramming is best done in a high-level domain-specific language. The salient properties of the occam-pi language are explicit concurrency with built-in mechanisms for interprocessorcommunication, provision for expressing dynamic parallelism, support for the expression of dynamic reconfigurations, and placement attributes. To evaluate the programming approach, a compiler framework was extended to support the language extensions in the occam-pi language, and backends were developed to target two different coarse-grained reconfigurable architectures. XPP and Ambric. The results on XPP reveal that the occam-pi based implementations produce comparable throughput to those of NML programs, while programming at a much higher level of abstraction than that of NML. Similarly the two occam-pi implementations of autofocus criterion calculation targeted to the Ambric platform outperform the CPU implementation by factors of 11-23. Thus, the results of the implemented case-studies suggest that the occam-pi language based approach simplifies the development of applications employing run-time reconfigurable devices without compromising the performance benefits.

APA, Harvard, Vancouver, ISO, and other styles

13

Ul-Abdin, Zain. "Programming of Coarse-Grained Reconfigurable Architectures." Doctoral thesis, Högskolan i Halmstad, Centrum för forskning om inbyggda system (CERES), 2011. http://urn.kb.se/resolve?urn=urn:nbn:se:hh:diva-15050.

Full text

Abstract:

Coarse-grained reconfigurable architectures, which offer massive parallelism coupled with the capability of undergoing run-time reconfiguration, are gaining attention in order to meet not only the increased computational demands of high-performance embedded systems, but also to fulfill the need of adaptability to functional requirements of the application. This thesis focuses on the programming aspects of such coarse-grained reconfigurable computing devices, including the relevant computation models that are capable of exposing different kinds of parallelism inherent in the application and the ability of these models to capture the adaptability requirements of the application. The thesis suggests the occam-pi language for programming of a broad class of coarse-grained reconfigurable architectures as an intermediate language; we call it intermediate, since we believe that the applicationprogramming is best done in a high-level domain-specific language. The salient properties of the occam-pi language are explicit concurrency with built-in mechanisms for interprocessorcommunication, provision for expressing dynamic parallelism, support for the expression of dynamic reconfigurations, and placement attributes. To evaluate the programming approach, a compiler framework was extended to support the language extensions in the occam-pi language, and backends were developed to target two different coarse-grained reconfigurable architectures. XPP and Ambric. The results on XPP reveal that the occam-pi based implementations produce comparable throughput to those of NML programs, while programming at a much higher level of abstraction than that of NML. Similarly the two occam-pi implementations of autofocus criterion calculation targeted to the Ambric platform outperform the CPU implementation by factors of 11-23. Thus, the results of the implemented case-studies suggest that the occam-pi language based approach simplifies the development of applications employing run-time reconfigurable devices without compromising the performance benefits.

APA, Harvard, Vancouver, ISO, and other styles

14

Matsa, M. Morris E. (Moshe Morris Emanuel). "Compiling for coarse-grain reconfigurable architectures." Thesis, Massachusetts Institute of Technology, 1997. http://hdl.handle.net/1721.1/43484.

Full text

Abstract:

Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1997.
Includes bibliographical references (p. 221-224).
by M. Morris E. Matsa.
M.Eng.

APA, Harvard, Vancouver, ISO, and other styles

15

Badawi, Mohammad. "Adaptive Coarse-grain Reconfigurable Protocol Processing Architecture." Doctoral thesis, KTH, Elektronik och Inbyggda System, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-194400.

Full text

Abstract:

Digital signal processors and their variants have provided significant benefit to efficient implementation of Physical Layer (PHY) of Open Systems Interconnection (OSI) model’s seven-layer protocol processing stack compared to the general purpose processors. Protocol processors promise to provide a similar advantage for implementing higher layers in the (OSI)'s seven-layer model. This thesis addresses the problem of designing customizable coarse-grain reconfigurable protocol processing fabrics as a solution to achieving high performance and computational efficiency. A key requirement that this thesis addresses is the ability to not only adapt to varying applications and standards, and different modes in each standard but also to time varying load and performance demands while maintaining quality of service.This thesis presents a tile-based multicore protocol processing architecture that can be customized at design time to meet the requirements of the target application. The architecture can then be reconfigured at boot time and tuned to suit the desired use-case. This architecture includes a packet-oriented memory system that has deterministic access time and access energy costs, and hence can be accurately dimensioned to fulfill the requirements of the desired use-case. Moreover, to maintain quality of service as predicted, while minimizing the use of energy and resources, this architecture encompasses an elastic management scheme that controls run-time configuration to deploy processing resources based on use-case and traffic demands.To evaluate the architecture presented in this thesis, different case studies were conducted while quantitative and qualitative metrics were used for assessment. Energy-delay product, energy efficiency, area efficiency and throughput show the improvements that were achieved using the processing cores and the memory of the presented architecture, compared with other solutions. Furthermore, the results show the reduction in latency and power consumption required to evaluate controlling states when using the elastic management scheme. The elasticity of the scheme also resulted in reducing the total area required for the controllers that serve multiple processing cores in comparison with other designs. Finally, the results validate the ability of the presented architecture to support quality of service without misutilizing available energy during a real-life case study of a multi-participant Voice Over Internet Protocol (VOIP) call.

QC 20161028

APA, Harvard, Vancouver, ISO, and other styles

16

Bag, Zeki Ozan. "Energy-Aware Coarse Grained Reconfigurable Architectures Using Dynamically Reconfigurable Isolation Cells." Thesis, KTH, Skolan för informations- och kommunikationsteknik (ICT), 2012. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-108217.

Full text

Abstract:

This thesis presents a self adaptive power management system to improve energy efficiency of coarse-grained reconfigurable architectures (CGRAs). CGRAs can host multiple applications on a single platform. Moreover, a single application may have multiple versions which have different degree of parallelism (fully serial, partially serial, fully parallel etc.). Selection of the optimum application version depends on runtime conditions such as resource availability on the platform. A traditional worst case design to satisfy its specifications results in undesirable power efficiency. Existing solutions to this problem offer costly hardware to mainly employ dynamic voltage and frequency scaling (DVFS). We propose exploiting reconfiguration of available resources on CGRA. Our solution makes use of dynamically reconfigurable isolation cells (DRICs) instead of dedicated hardware. We also introduce autonomous parallelism, voltage and frequency selection (APVFS) to realize DVFS functionality and to select the optimum version. Three applications are used for simulations, namely; matrix multiplication, finite impulse response filter (FIR) and fast Fourier transform (FFT). Results show that up to 72 % and 55 % power and energy can be saved respectively. Synthesis of the fabric shows considerable reduction in area overheads compared to existing designs employing DVFS.

APA, Harvard, Vancouver, ISO, and other styles

17

Azad, Payandeh Siavoosh. "CRASIC: Customisation of Coarse Grain Recon gurable Architectures." Thesis, KTH, Skolan för informations- och kommunikationsteknik (ICT), 2012. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-104641.

Full text

Abstract:

The gap between CGRAs and ASIC designs is a major issue for all digital designers. The main objective of the following thesis is developing a method to customize the design according to the application to increase performance and decrease the area of the chip. although there are commercial high level synthesis tools which are able to synthesis an algorithmic level description of an application to ASIC, how ever they are not able to generate recon gurable hardware operating as multi-mode ASIC. In this project a tool has been developed which sweeps the design space between the fully recon gurable CGRA hardware template and ASIC. this tool generates a customized hardware based on the implemented applications and user speci cations by eliminating unused/unwanted components of the design. FFT and CP algorithms are used in this research in order to have some solid results for area and power consumption of the customized design. The results shows up to 44 percent reduction in area of a fabric containing both CP and 2048 points FFT and 20 percent reduction in power consumption for 2048 Point FFT on the very same fabric. Developing this approach on CGRAs can enable designers to have their ASIC designs just by implementing an algorithm on the fabric and then receive a customized fabric close to ASIC with some small automatic steps.

APA, Harvard, Vancouver, ISO, and other styles

18

Brant, Alexander Dunlop. "Coarse and fine grain programmable overlay architectures for FPGAs." Thesis, University of British Columbia, 2013. http://hdl.handle.net/2429/43918.

Full text

Abstract:

Overlay architectures are programmable logic systems that are compiled on top of a traditional FPGA. These architectures give designers flexibility, and have a number of benefits, such as being designed or optimized for specific application domains, making it easier or more efficient to implement solutions, being independent of platform, allowing the ability to do partial reconfiguration regardless of the underlying architecture, and allowing compilation without using vendor tools, in some cases with fully open source tool chains. This thesis describes the implementation of two FPGA overlay architectures, ZUMA and CARBON. These overlay implementations include optimizations to reduce area and increase speed which may be applicable to many other FPGAs and also ASIC systems. ZUMA is a fine-grain overlay which resembles a modern commercial FPGA, and is compatible with the VTR open source compilation tools. The implementation includes a number of novel features tailored to efficient FPGA implementation, including the utilization of reprogrammable LUTRAMs, a novel two-stage local routing crossbar, and an area efficient configuration controller. CARBON is a coarse-grain, time-multiplexed architecture, that directly implements the coarse-grain portion of the MALIBU architecture. MALIBU is a hybrid fine-grain and coarse-grain FPGA architecture that can be built using the combination of both CARBON and ZUMA, but this thesis focuses on their individual implementations. Time-multiplexing in CARBON greatly reduces performance, so it is vital to be optimized for delay. To push the speed of CARBON beyond the normal bound predicted by static timing analysis tools, this thesis has applied the Razor dynamic timing error tolerance system inside CARBON. This can dynamically push the clock frequency yet maintain correct operation. This required developing an extension of the Razor system from its original 1D feed-forward pipeline to a 2D bidirectional pipeline.

APA, Harvard, Vancouver, ISO, and other styles

19

Han, Wei. "Multi-core architectures with coarse-grained dynamically reconfigurable processors for broadband wireless access technologies." Thesis, University of Edinburgh, 2010. http://hdl.handle.net/1842/3812.

Full text

Abstract:

Broadband Wireless Access technologies have significant market potential, especially the WiMAX protocol which can deliver data rates of tens of Mbps. Strong demand for high performance WiMAX solutions is forcing designers to seek help from multi-core processors that offer competitive advantages in terms of all performance metrics, such as speed, power and area. Through the provision of a degree of flexibility similar to that of a DSP and performance and power consumption advantages approaching that of an ASIC, coarse-grained dynamically reconfigurable processors are proving to be strong candidates for processing cores used in future high performance multi-core processor systems. This thesis investigates multi-core architectures with a newly emerging dynamically reconfigurable processor – RICA, targeting WiMAX physical layer applications. A novel master-slave multi-core architecture is proposed, using RICA processing cores. A SystemC based simulator, called MRPSIM, is devised to model this multi-core architecture. This simulator provides fast simulation speed and timing accuracy, offers flexible architectural options to configure the multi-core architecture, and enables the analysis and investigation of multi-core architectures. Meanwhile a profiling-driven mapping methodology is developed to partition the WiMAX application into multiple tasks as well as schedule and map these tasks onto the multi-core architecture, aiming to reduce the overall system execution time. Both the MRPSIM simulator and the mapping methodology are seamlessly integrated with the existing RICA tool flow. Based on the proposed master-slave multi-core architecture, a series of diverse homogeneous and heterogeneous multi-core solutions are designed for different fixed WiMAX physical layer profiles. Implemented in ANSI C and executed on the MRPSIM simulator, these multi-core solutions contain different numbers of cores, combine various memory architectures and task partitioning schemes, and deliver high throughputs at relatively low area costs. Meanwhile a design space exploration methodology is developed to search the design space for multi-core systems to find suitable solutions under certain system constraints. Finally, laying a foundation for future multithreading exploration on the proposed multi-core architecture, this thesis investigates the porting of a real-time operating system – Micro C/OS-II to a single RICA processor. A multitasking version of WiMAX is implemented on a single RICA processor with the operating system support.

APA, Harvard, Vancouver, ISO, and other styles

20

Malla, Tika Kumari. "Case Studies to Learn Human Mapping Strategies in a Variety of Coarse-Grained Reconfigurable Architectures." Thesis, University of North Texas, 2017. https://digital.library.unt.edu/ark:/67531/metadc984195/.

Full text

Abstract:

Computer hardware and algorithm design have seen significant progress over the years. It is also seen that there are several domains in which humans are more efficient than computers. For example in image recognition, image tagging, natural language understanding and processing, humans often find complicated algorithms quite easy to grasp. This thesis presents the different case studies to learn human mapping strategy to solve the mapping problem in the area of coarse-grained reconfigurable architectures (CGRAs). To achieve optimum level performance and consume less energy in CGRAs, place and route problem has always been a major concern. Making use of human characteristics can be helpful in problems as such, through pattern recognition and experience. Therefore to conduct the case studies a computer mapping game called UNTANGLED was analyzed as a medium to convey insights of human mapping strategies in a variety of architectures. The purpose of this research was to learn from humans so that we can come up with better algorithms to outperform the existing algorithms. We observed how human strategies vary as we present them with different architectures, different architectures with constraints, different visualization as well as how the quality of solution changes with experience. In this work all the case studies obtained from exploiting human strategies provide useful feedback that can improve upon existing algorithms. These insights can be adapted to find the best architectural solution for a particular domain and for future research directions for mapping onto mesh-and- stripe based CGRAs.

APA, Harvard, Vancouver, ISO, and other styles

21

Balavendran, Joseph Rani Deepika. "Gamification to Solve a Mapping Problem in Electrical Engineering." Thesis, University of North Texas, 2020. https://digital.library.unt.edu/ark:/67531/metadc1703330/.

Full text

Abstract:

Coarse-Grained Reconfigurable Architectures (CGRAs) are promising in developing high performance low-power portable applications. In this research, we crowdsource a mapping problem using gamification to harnass human intelligence. A scientific puzzle game, Untangled, was developed to solve a mapping problem by encapsulating architectural characteristics. The primary motive of this research is to draw insights from the mapping solutions of players who possess innate abilities like decision-making, creative problem-solving, recognizing patterns, and learning from experience. In this dissertation, an extensive analysis was conducted to investigate how players' computational skills help to solve an open-ended problem with different constraints. From this analysis, we discovered a few common strategies among players, and subsequently, a library of dictionaries containing identified patterns from players' solutions was developed. The findings help to propose a better version of the game that incorporates these techniques recognized from the experience of players. In the future, an updated version of the game that can be developed may help low-performance players to provide better solutions for a mapping problem. Eventually, these solutions may help to develop efficient mapping algorithms, In addition, this research can be an exemplar for future researchers who want to crowdsource such electrical engineering problems and this approach can also be applied to other domains.

APA, Harvard, Vancouver, ISO, and other styles

22

Ioannou, Nikolas. "Complementing user-level coarse-grain parallelism with implicit speculative parallelism." Thesis, University of Edinburgh, 2012. http://hdl.handle.net/1842/7900.

Full text

Abstract:

Multi-core and many-core systems are the norm in contemporary processor technology and are expected to remain so for the foreseeable future. Parallel programming is, thus, here to stay and programmers have to endorse it if they are to exploit such systems for their applications. Programs using parallel programming primitives like PThreads or OpenMP often exploit coarse-grain parallelism, because it offers a good trade-off between programming effort versus performance gain. Some parallel applications show limited or no scaling beyond a number of cores. Given the abundant number of cores expected in future many-cores, several cores would remain idle in such cases while execution performance stagnates. This thesis proposes using cores that do not contribute to performance improvement for running implicit fine-grain speculative threads. In particular, we present a many-core architecture and protocols that allow applications with coarse-grain explicit parallelism to further exploit implicit speculative parallelism within each thread. We show that complementing parallel programs with implicit speculative mechanisms offers significant performance improvements for a large and diverse set of parallel benchmarks. Implicit speculative parallelism frees the programmer from the additional effort to explicitly partition the work into finer and properly synchronized tasks. Our results show that, for a many-core comprising 128 cores supporting implicit speculative parallelism in clusters of 2 or 4 cores, performance improves on top of the highest scalability point by 44% on average for the 4-core cluster and by 31% on average for the 2-core cluster. We also show that this approach often leads to better performance and energy efficiency compared to existing alternatives such as Core Fusion and Turbo Boost. Moreover, we present a dynamic mechanism to choose the number of explicit and implicit threads, which performs within 6% of the static oracle selection of threads. To improve energy efficiency processors allow for Dynamic Voltage and Frequency Scaling (DVFS), which enables changing their performance and power consumption on-the-fly. We evaluate the amenability of the proposed explicit plus implicit threads scheme to traditional power management techniques for multithreaded applications and identify room for improvement. We thus augment prior schemes and introduce a novel multithreaded power management scheme that accounts for implicit threads and aims to minimize the Energy Delay2 product (ED2). Our scheme comprises two components: a “local” component that tries to adapt to the different program phases on a per explicit thread basis, taking into account implicit thread behavior, and a “global” component that augments the local components with information regarding inter-thread synchronization. Experimental results show a reduction of ED2 of 8% compared to having no power management, with an average reduction in power of 15% that comes at a minimal loss of performance of less than 3% on average.

APA, Harvard, Vancouver, ISO, and other styles

23

Muir, Mark I. R. "Re-targetable tools and methodologies for the efficient deployment of high-level source code on coarse-grained dynamically reconfigurable architectures." Thesis, University of Edinburgh, 2009. http://hdl.handle.net/1842/27072.

Full text

Abstract:

Reconfigurable computing traditionally consists of a data path machine (such as an FPGA) acting as a co-processor to a conventional microprocessor. This involves partitioning the application such that the data path intensive parts are implemented on the reconfigurable fabric, and the control flow intensive parts are implemented on the microprocessor. Often the two parts have to be written in different languages. New highly parallel data path architectures allow parallelism approaching that of FPGAs, but are able to be reconfigured very rapidly. As a result, it is possible to use these architectures to perform control flow in a manner similar to a microprocessor, and thus a complete program can be described from an unmodified high-level language (in particular C). This overcomes the historical instruction-level parallelism (ILP) wall. To make full use of the available parallelism, existing microprocessor tool flows are insufficient. Data path machines are typically programmed via HDL tools from the ASIC design world. This expresses algorithms at a lower level than the application algorithms are typically developed in. The work in this thesis builds upon earlier work to allow applications to be described from high-level languages, by employing low-level optimisations in the compiler back-end and working from the assembly, to maximise parallel efficiency. This consists of scheduling, where known techniques are used to pack instructions into basic blocks that map well to the reconfigurable core (optimising spatial efficiency); then automatic pipelining is applied to dramatically improve the achievable throughput (optimising temporal efficiency). Together these can be thought of as 'instruction-level parallelism done right'. Speed-ups of more than an order of magnitude were achieved, yielding throughputs of 180-380MPixels/s on typical image signal processing tasks, matching the performance of hard-wired ASICs. Furthermore, conventional software-based simulation technologies for data path machines are too slow for use in application verification. This thesis demonstrates how a high-speed software emulator can be created for self-controlled dynamically reconfigurable data path machines, using a static serialisation of the data paths in each configuration context. This yields run-time performance several orders of magnitude higher than existing techniques, making it suitable for use in feedback-directed optimisation.

APA, Harvard, Vancouver, ISO, and other styles

24

Tafesse, Solomon. "Physical properties of coarse particles in till coupled to bedrock composition based on new 3D image analysis method." Licentiate thesis, KTH, Land and Water Resources Engineering, 2010. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-11988.

Full text

Abstract:

The physical properties of the coarse fraction of the till (0.4 to 20 cm) and the surface boulders have been studied at two different sites in Sweden. The research work included: development of a new image analysis software for 3D size and shape measurements of particles; lithological analysis on multiple size fractions in till and magnetic susceptibility survey on coarse till clasts, surface boulders and local bedrock.

The new 3D image analysis method provides an enormous amount of size and shape data for each particle in the coarse fraction (2 to 20 cm) in till. The method is suitable for field study, cost effective and the software is executable in Matlab. The field imaging method together with the image analysis software give non subjective results of size and shape of coarse particles and makes it feasible and easy to study representative sample size, which is one tonne for testing clasts of size up to 20 cm.

The lithological analysis of the multiple size fraction of the till clasts has been investigated on six different size fractions of the till (0.4 to 20 cm); the result of the different samples from the two sites shows that this method can potentially be used as a stratigraphic tool in the areas where there is no unique indicator lithologies.

The magnetic susceptibility has been made on the surface boulders, the 6-20 cm till fraction and on insitu bedrock outcrops near to the study sites. The method has good potential for determining stratigraphic relationships between different till units as well as for determining the provenance ofcoarse clasts and surface boulders.

APA, Harvard, Vancouver, ISO, and other styles

25

Shelor, Charles F. "Dataflow Processing in Memory Achieves Significant Energy Efficiency." Thesis, University of North Texas, 2018. https://digital.library.unt.edu/ark:/67531/metadc1248478/.

Full text

Abstract:

The large difference between processor CPU cycle time and memory access time, often referred to as the memory wall, severely limits the performance of streaming applications. Some data centers have shown servers being idle three out of four clocks. High performance instruction sequenced systems are not energy efficient. The execute stage of even simple pipeline processors only use 9% of the pipeline's total energy. A hybrid dataflow system within a memory module is shown to have 7.2 times the performance with 368 times better energy efficiency than an Intel Xeon server processor on the analyzed benchmarks. The dataflow implementation exploits the inherent parallelism and pipelining of the application to improve performance without the overhead functions of caching, instruction fetch, instruction decode, instruction scheduling, reorder buffers, and speculative execution used by high performance out-of-order processors. Coarse grain reconfigurable logic in an energy efficient silicon process provides flexibility to implement multiple algorithms in a low energy solution. Integrating the logic within a 3D stacked memory module provides lower latency and higher bandwidth access to memory while operating independently from the host system processor.

APA, Harvard, Vancouver, ISO, and other styles

26

Kim, Yoonjin. "DESIGNING COST-EFFECTIVE COARSE-GRAINED RECONFIGURABLE ARCHITECTURE." 2009. http://hdl.handle.net/1969.1/ETD-TAMU-2009-05-649.

Full text

Abstract:

Application-specific optimization of embedded systems becomes inevitable to satisfy the market demand for designers to meet tighter constraints on cost, performance and power. On the other hand, the flexibility of a system is also important to accommodate the short time-to-market requirements for embedded systems. To compromise these incompatible demands, coarse-grained reconfigurable architecture (CGRA) has emerged as a suitable solution. A typical CGRA requires many processing elements (PEs) and a configuration cache for reconfiguration of its PE array. However, such a structure consumes significant area and power. Therefore, designing cost-effective CGRA has been a serious concern for reliability of CGRA-based embedded systems. As an effort to provide such cost-effective design, the first half of this work focuses on reducing power in the configuration cache. For power saving in the configuration cache, a low power reconfiguration technique is presented based on reusable context pipelining achieved by merging the concept of context reuse into context pipelining. In addition, we propose dynamic context compression capable of supporting only required bits of the context words set to enable and the redundant bits set to disable. Finally, we provide dynamic context management capable of reducing reduce power consumption in configuration cache by controlling a read/write operation of the redundant context words In the second part of this dissertation, we focus on designing a cost-effective PE array to reduce area and power. For area and power saving in a PE array, we devise a costeffective array fabric addresses novel rearrangement of processing elements and their interconnection designs to reduce area and power consumption. In addition, hierarchical reconfigurable computing arrays are proposed consisting of two reconfigurable computing blocks with two types of communication structure together. The two computing blocks have shared critical resources and such a sharing structure provides efficient communication interface between them with reducing overall area. Based on the proposed design approaches, a CGRA combining the multiple design schemes is shown to verify the synergy effect of the integrated approach. Experimental results show that the integrated approach reduces area by 23.07% and power by up to 72% when compared with the conventional CGRA.

APA, Harvard, Vancouver, ISO, and other styles

27

"Compiler and Architecture Design for Coarse-Grained Programmable Accelerators." Doctoral diss., 2015. http://hdl.handle.net/2286/R.I.34909.

Full text

Abstract:

abstract: The holy grail of computer hardware across all market segments has been to sustain performance improvement at the same pace as silicon technology scales. As the technology scales and the size of transistors shrinks, the power consumption and energy usage per transistor decrease. On the other hand, the transistor density increases significantly by technology scaling. Due to technology factors, the reduction in power consumption per transistor is not sufficient to offset the increase in power consumption per unit area. Therefore, to improve performance, increasing energy-efficiency must be addressed at all design levels from circuit level to application and algorithm levels. At architectural level, one promising approach is to populate the system with hardware accelerators each optimized for a specific task. One drawback of hardware accelerators is that they are not programmable. Therefore, their utilization can be low as they perform one specific function. Using software programmable accelerators is an alternative approach to achieve high energy-efficiency and programmability. Due to intrinsic characteristics of software accelerators, they can exploit both instruction level parallelism and data level parallelism. Coarse-Grained Reconfigurable Architecture (CGRA) is a software programmable accelerator consists of a number of word-level functional units. Motivated by promising characteristics of software programmable accelerators, the potentials of CGRAs in future computing platforms is studied and an end-to-end CGRA research framework is developed. This framework consists of three different aspects: CGRA architectural design, integration in a computing system, and CGRA compiler. First, the design and implementation of a CGRA and its instruction set is presented. This design is then modeled in a cycle accurate system simulator. The simulation platform enables us to investigate several problems associated with a CGRA when it is deployed as an accelerator in a computing system. Next, the problem of mapping a compute intensive region of a program to CGRAs is formulated. From this formulation, several efficient algorithms are developed which effectively utilize CGRA scarce resources very well to minimize the running time of input applications. Finally, these mapping algorithms are integrated in a compiler framework to construct a compiler for CGRA
Dissertation/Thesis
Doctoral Dissertation Computer Science 2015

APA, Harvard, Vancouver, ISO, and other styles

28

Sousa, Diogo Alexandre Ribeiro de. "VLSI design of configurable low-power coarse-grained array architecture." Dissertação, 2017. https://repositorio-aberto.up.pt/handle/10216/105383.

Full text

Abstract:

Biomedical signal acquisition from in- or on-body sensors often requires local (on-node) low-level pre-processing before the data are sent to a remote node for aggregation and further processing. Local processing is required for many different operations, which include signal cleanup (noise removal), sensor calibration, event detection and data compression. In this environment, processing is subject to aggressive energy consumption restrictions, while often operating under real-time requirements. These conflicting requirements impose the use of dedicated circuits addressing a very specific task or the use of domain-specific customization to obtain significant gains in power efficiency. However, economic and time-to-market constraints often make the development or use of application-specific platforms very risky.One way to address these challenges is to develop a sensor node with a general-purpose architecture combining a low-power, low-performance general microprocessor or micro-controller with a coarse-grained reconfigurable array (CGRA) acting as an accelerator. A CGRA consists of a fixed number of processing units (e.g., ALUs) whose function and interconnections are determined by some configuration data.The objective of this work is to create an RTL-level description of a low-power CGRA of ALUs and produce a low-power VLSI (standard cell) implementation, that supports power-saving features.The CGRA implementation should use as few resources as possible and fully exploit the intended operation environment. The design will be evaluated with a set of simple signal processing tasks

APA, Harvard, Vancouver, ISO, and other styles

29

Lopes, João Pedro Sauvarin. "Configurable coarse-grained array architecture for processing of biological signals." Dissertação, 2017. https://repositorio-aberto.up.pt/handle/10216/105391.

Full text

Abstract:

With the emerging interest in low power biological signal monitoring, the market for small, low power, highly integrated solutions for signal acquisition and processing has grown substantially. This project is inserted in one such solution, a low profile, wearable device capable of acquiring, analysing and transmitting data. Applications range from health and sports monitoring to clinical applications for short term study or long term monitoring. This kind of device is particularly interesting as it allows patients to be free and active while being monitored, improving the value of the results and potentially reducing the cost of health care.This dissertation is focused on studying CGRA (coarse-grained reconfigurable array) architectures for this application domain. By using a small array of interconnected prepossessing elements one can leverage both parallelization and pipelining to efficiently process substantial amount of data.Reaching an appropriate architecture requires the study of the trade-offs between functionality, versatility, throughput and energy consumption for this type of processing. The result from this analysis is an accelerator architecture to be inserted in a larger project named NanoStima.

APA, Harvard, Vancouver, ISO, and other styles

30

Kwok, Zion Siu-On. "Register file architecture optimization in a coarse-grained reconfigurable array." Thesis, 2005. http://hdl.handle.net/2429/16551.

Full text

Abstract:

This thesis investigates the impact of the global and local register file architecture on a reconfigurable system based on the ADRES architecture. The register files consume a significant amount of area on the reconfigurable device, and their architecture has a strong impact on the performance. We found that the global registers should be tightly connected to as many functional units as possible, while the connection of the local register files to their neighbours is less critical. We found that the global register file should contain 14 registers, while each local register file should only contain two registers. We used these results to propose two new architectures that demonstrate between -33% and 383% higher instructions per cycle per unit area compared to the original 4x4 and 8x8 array architectures, with 56% and 88% average improvement over a set of benchmarks for the new 4x4 and 8x8 array architectures, respectively.
Applied Science, Faculty of
Electrical and Computer Engineering, Department of
Graduate

APA, Harvard, Vancouver, ISO, and other styles

31

Varadarajan, Keshavan. "A Coarse Grained Reconfigurable Architecture Framework Supporting Macro-Dataflow Execution." Thesis, 2012. http://etd.iisc.ernet.in/handle/2005/2302.

Full text

Abstract:

A Coarse-Grained Reconfigurable Architecture (CGRA) is a processing platform which constitutes an interconnection of coarse-grained computation units (viz. Function Units (FUs), Arithmetic Logic Units (ALUs)). These units communicate directly, viz. send-receive like primitives, as opposed to the shared memory based communication used in multi-core processors. CGRAs are a well-researched topic and the design space of a CGRA is quite large. The design space can be represented as a 7-tuple (C, N, T, P, O, M, H) where each of the terms have the following meaning: C -choice of computation unit, N -choice of interconnection network, T -Choice of number of context frame (single or multiple), P -presence of partial reconfiguration, O choice of orchestration mechanism, M -design of memory hierarchy and H host-CGRA coupling. In this thesis, we develop an architectural framework for a Macro-Dataflow based CGRA where we make the following choice for each of these parameters: C -ALU, N -Network-on-Chip (NoC), T -Multiple contexts, P -support for partial reconfiguration, O -Macro Dataflow based orchestration, M -data memory banks placed at the periphery of the reconfigurable fabric (reconfigurable fabric is the name given to the interconnection of computation units), H -loose coupling between host processor and CGRA, enabling our CGRA to execute an application independent of the host-processor’s intervention. The motivations for developing such a CGRA are: To execute applications efficiently through reduction in reconfiguration time (i.e. the time needed to transfer instructions and data to the reconfigurable fabric) and reduction in execution time through better exploitation of all forms of parallelism: Instruction Level Parallelism (ILP), Data Level Parallelism (DLP) and Thread/Task Level Parallelism (TLP). We choose a macro-dataflow based orchestration framework in combination with partial reconfiguration so as to ease exploitation of TLP and DLP. Macro-dataflow serves as a light weight synchronization mechanism. We experiment with two variants of the macro-dataflow orchestration units, namely: hardware controlled orchestration unit and the compiler controlled orchestration unit. We employ a NoC as it helps reduce the reconfiguration overhead. To permit customization of the CGRA for a particular domain through the use of domain-specific custom-Intellectual Property (IP) blocks. This aids in improving both application performance and makes it energy efficient. To develop a CGRA which is completely programmable and accepts any program written using the C89 standard. The compiler and the architecture were co-developed to ensure that every feature of the architecture could be automatically programmed through an application by a compiler. In this CGRA framework, the orchestration mechanism (O) and the host-CGRA coupling (H) are kept fixed and we permit design space exploration of the other terms in the 7-tuple design space. The mode of compilation and execution remains invariant of these changes, hence referred to as a framework. We now elucidate the compilation and execution flow for this CGRA framework. An application written in C language is compiled and is transformed into a set of temporal partitions, referred to as HyperOps in this thesis. The macro-dataflow orchestration unit selects a HyperOp for execution when all its inputs are available. The instructions and operands for a ready HyperOp are transferred to the reconfigurable fabric for execution. Each ALU (in the computation unit) is capable of waiting for the availability of the input data, prior to issuing instructions. We permit the launch and execution of a temporal partition to progress in parallel, which reduces the reconfiguration overhead. We further cut launch delays by keeping loops persistent on fabric and thus eliminating the need to launch the instructions. The CGRA framework has been implemented using Bluespec System Verilog. We evaluate the performance of two of these CGRA instances: one for cryptographic applications and another instance for linear algebra kernels. We also run other general purpose integer and floating point applications to demonstrate the generic nature of these optimizations. We explore various microarchitectural optimizations viz. pipeline optimizations (i.e. changing value of T ), different forms of macro dataflow orchestration such as hardware controlled orchestration unit and compiler-controlled orchestration unit, different execution modes including resident loops, pipeline parallelism, changes to the router etc. As a result of these optimizations we observe 2.5x improvement in performance as compared to the base version. The reconfiguration overhead was hidden through overlapping launching of instructions with execution making. The perceived reconfiguration overhead is reduced drastically to about 9-11 cycles for each HyperOp, invariant of the size of the HyperOp. This can be mainly attributed to the data dependent instruction execution and use of the NoC. The overhead of the macro-dataflow execution unit was reduced to a minimum with the compiler controlled orchestration unit. To benchmark the performance of these CGRA instances, we compare the performance of these with an Intel Core 2 Quad running at 2.66GHz. On the cryptographic CGRA instance, running at 700MHz, we observe one to two orders of improvement in performance for cryptographic applications and up to one order of magnitude performance degradation for linear algebra CGRA instance. This relatively poor performance of linear algebra kernels can be attributed to the inability in exploiting ILP across computation units interconnected by the NoC, long latency in accessing data memory placed at the periphery of the reconfigurable fabric and unavailability of pipelined floating point units (which is critical to the performance of linear algebra kernels). The superior performance of the cryptographic kernels can be attributed to higher computation to load instruction ratio, careful choice of custom IP block, ability to construct large HyperOps which allows greater portion of the communication to be performed directly (as against communication through a register file in a general purpose processor) and the use of resident loops execution mode. The power consumption of a computation unit employed on the cryptography CGRA instance, along with its router is about 76mW, as estimated by Synopsys Design Vision using the Faraday 90nm technology library for an activity factor of 0.5. The power of other instances would be dependent on specific instantiation of the domain specific units. This implies that for a reconfigurable fabric of size 5 x 6 the total power consumption is about 2.3W. The area and power ( 84mW) dissipated by the macro dataflow orchestration unit, which is common to both instances, is comparable to a single computation unit, making it an effective and low overhead technique to exploit TLP.

APA, Harvard, Vancouver, ISO, and other styles

32

Sousa, Diogo Alexandre Ribeiro de. "VLSI design of configurable low-power coarse-grained array architecture." Master's thesis, 2017. https://repositorio-aberto.up.pt/handle/10216/105383.

Full text

Abstract:

Biomedical signal acquisition from in- or on-body sensors often requires local (on-node) low-level pre-processing before the data are sent to a remote node for aggregation and further processing. Local processing is required for many different operations, which include signal cleanup (noise removal), sensor calibration, event detection and data compression. In this environment, processing is subject to aggressive energy consumption restrictions, while often operating under real-time requirements. These conflicting requirements impose the use of dedicated circuits addressing a very specific task or the use of domain-specific customization to obtain significant gains in power efficiency. However, economic and time-to-market constraints often make the development or use of application-specific platforms very risky.One way to address these challenges is to develop a sensor node with a general-purpose architecture combining a low-power, low-performance general microprocessor or micro-controller with a coarse-grained reconfigurable array (CGRA) acting as an accelerator. A CGRA consists of a fixed number of processing units (e.g., ALUs) whose function and interconnections are determined by some configuration data.The objective of this work is to create an RTL-level description of a low-power CGRA of ALUs and produce a low-power VLSI (standard cell) implementation, that supports power-saving features.The CGRA implementation should use as few resources as possible and fully exploit the intended operation environment. The design will be evaluated with a set of simple signal processing tasks

APA, Harvard, Vancouver, ISO, and other styles

33

Lopes, João Pedro Sauvarin. "Configurable coarse-grained array architecture for processing of biological signals." Master's thesis, 2017. https://repositorio-aberto.up.pt/handle/10216/105391.

Full text

Abstract:

With the emerging interest in low power biological signal monitoring, the market for small, low power, highly integrated solutions for signal acquisition and processing has grown substantially. This project is inserted in one such solution, a low profile, wearable device capable of acquiring, analysing and transmitting data. Applications range from health and sports monitoring to clinical applications for short term study or long term monitoring. This kind of device is particularly interesting as it allows patients to be free and active while being monitored, improving the value of the results and potentially reducing the cost of health care.This dissertation is focused on studying CGRA (coarse-grained reconfigurable array) architectures for this application domain. By using a small array of interconnected prepossessing elements one can leverage both parallelization and pipelining to efficiently process substantial amount of data.Reaching an appropriate architecture requires the study of the trade-offs between functionality, versatility, throughput and energy consumption for this type of processing. The result from this analysis is an accelerator architecture to be inserted in a larger project named NanoStima.

APA, Harvard, Vancouver, ISO, and other styles

34

Biswas, Prasenjit. "Hardware Consolidation Of Systolic Algorithms On A Coarse Grained Runtime Reconfigurable Architecture." Thesis, 2011. http://etd.iisc.ernet.in/handle/2005/2108.

Full text

Abstract:

Application domains such as Bio-informatics, DSP, Structural Biology, Fluid Dynamics, high resolution direction finding, state estimation, adaptive noise cancellation etc. demand high performance computing solutions for their simulation environments. The core computations of these applications are in Numerical Linear Algebra (NLA) kernels. Direct solvers are predominantly required in the domains like DSP, estimation algorithms like Kalman Filter etc, where the matrices on which operations need to be performed are either small or medium sized, but dense. Faddeev's Algorithm is often used for solving dense linear system of equations. Modified Faddeev's algorithm (MFA) is a general algorithm on which LU decomposition, QR factorization or SVD of matrices can be realized. MFA has the good property of realizing a host of matrix operations by computing the Schur complements on four blocked matrices, thereby reducing the overall computation requirements. We will use MFA as a representative Direct Solver in this work. We further discuss Given's rotation based QR algorithm for Decomposition of any matrix, often used to solve the linear least square problem. Systolic Array Architectures are widely accepted ASIC solutions for NLA algorithms. But the \can of worms" associated with this traditional solution spawns the need for alternative solutions. While popular custom hardware solution in form of systolic arrays can deliver high performance, but because of their rigid structure they are not scalable and reconfigurable, and hence not commercially viable. We show how a Reconfigurable computing platform can serve to contain the \can of worms". REDEFINE, a coarse grained runtime reconfigurable architecture has been used for systolic actualization of NLA kernels. We elaborate upon streaming NLA-specific enhancements to REDEFINE in order to meet expected performance goals. We explore the need for an algorithm aware custom compilation framework. We bring about a proposition to realize Faddeev's Algorithm on REDEFINE. We show that REDEFINE performs several times faster than traditional GPPs. Further we direct our interest to QR Decomposition to be the next NLA kernel as it ensures better stability than LU and other decompositions. We use QR Decomposition as a case study to explore the design space of the proposed solution on REDEFINE. We also investigate the architectural details of the Custom Functional Units (CFU) for these NLA kernels. We determine the right size of the sub-array in accordance with the optimal pipeline depth of the core execution units and the number of such units to be used per sub-array. The framework used to realize QR Decomposition can be generalized for the realization of other algorithms dealing with decompositions like LU, Faddeev's Algorithm, Gauss-Jordon etc with different CFU definitions .

APA, Harvard, Vancouver, ISO, and other styles

35

Alle, Mythri. "Compiling For Coarse-Grained Reconfigurable Architectures Based On Dataflow Execution Paradigm." Thesis, 2012. http://etd.iisc.ernet.in/handle/2005/2453.

Full text

Abstract:

Coarse-Grained Reconfigurable Architectures(CGRAs) can be employed for accelerating computational workloads that demand both flexibility and performance. CGRAs comprise a set of computation elements interconnected using a network and this interconnection of computation elements is referred to as a reconfigurable fabric. The size of application that can be accommodated on the reconfigurable fabric is limited by the size of instruction buffers associated with each Compute element. When an application cannot be accommodated entirely, application is partitioned such that each of these partitions can be executed on the reconfigurable fabric. These partitions are scheduled by an orchestrator. The orchestrator employs dynamic dataflow execution paradigm. Dynamic dataflow execution paradigm has inherent support for synchronization and helps in exploitation of parallelism that exists across application partitions. In this thesis, we present a compiler that targets such CGRAs. The compiler presented in this thesis is capable of accepting applications specified in C89 standard. To enable architectural design space exploration, the compiler is designed such that it can be customized for several instances of CGRAs employing dataflow execution paradigm at the orchestrator. This can be achieved by specifying the appropriate configuration parameters to the compiler. The focus of this thesis is to provide efficient support for various kinds of parallelism while ensuring correctness. The compiler is designed to support fine-grained task level parallelism that exists across iterations of loops and function calls. Additionally, compiler can also support pipeline parallelism, where a loop is split into multiple stages that execute in a pipelined manner. The prototype compiler, which targets multiple instances of a CGRA, is demonstrated in this thesis. We used this compiler to target multiple variants of CGRAs employing dataflow execution paradigm. We varied the reconfigur-able fabric, orchestration mechanism employed, size of instruction buffers. We also choose applications from two different domains viz. cryptography and linear algebra. The execution time of the CGRA (the best among all instances) is compared against an Intel Quad core processor. Cryptography applications show a performance improvement ranging from more than one order of magnitude to close to two orders of magnitude. These applications have large amounts of ILP and our compiler could successfully expose the ILP available in these applications. Further, the domain customization also played an important role in achieving good performance. We employed two custom functional units for accelerating Cryptography applications and compiler could efficiently use them. In linear algebra kernels we observe multiple iterations of the loop executing in parallel, effectively exploiting loop-level parallelism at runtime. Inspite of this we notice close to an order of magnitude performance degradation. The reason for this degradation can be attributed to the use of non-pipelined floating point units, and the delays involved in accessing memory. Pipeline parallelism was demonstrated using this compiler for FFT and QR factorization. Thus, the compiler is capable of efficiently supporting different kinds of parallelism and can support complete C89 standard. Further, the compiler can also support different instances of CGRAs employing dataflow execution paradigm.

APA, Harvard, Vancouver, ISO, and other styles

36

"Scalable Register File Architecture for CGRA Accelerators." Master's thesis, 2016. http://hdl.handle.net/2286/R.I.40738.

Full text

Abstract:

abstract: Coarse-grained Reconfigurable Arrays (CGRAs) are promising accelerators capable of accelerating even non-parallel loops and loops with low trip-counts. One challenge in compiling for CGRAs is to manage both recurring and nonrecurring variables in the register file (RF) of the CGRA. Although prior works have managed recurring variables via rotating RF, they access the nonrecurring variables through either a global RF or from a constant memory. The former does not scale well, and the latter degrades the mapping quality. This work proposes a hardware-software codesign approach in order to manage all the variables in a local nonrotating RF. Hardware provides modulo addition based indexing mechanism to enable correct addressing of recurring variables in a nonrotating RF. The compiler determines the number of registers required for each recurring variable and configures the boundary between the registers used for recurring and nonrecurring variables. The compiler also pre-loads the read-only variables and constants into the local registers in the prologue of the schedule. Synthesis and place-and-route results of the previous and the proposed RF design show that proposed solution achieves 17% better cycle time. Experiments of mapping several important and performance-critical loops collected from MiBench show proposed approach improves performance (through better mapping) by 18%, compared to using constant memory.
Dissertation/Thesis
Masters Thesis Computer Science 2016

APA, Harvard, Vancouver, ISO, and other styles

37

Merchant, Farhad. "Algorithm-Architecture Co-Design for Dense Linear Algebra Computations." Thesis, 2015. http://etd.iisc.ernet.in/2005/3958.

Full text

Abstract:

Achieving high computation efficiency, in terms of Cycles per Instruction (CPI), for high-performance computing kernels is an interesting and challenging research area. Dense Linear Algebra (DLA) computation is a representative high-performance computing ap- plication, which is used, for example, in LU and QR factorizations. Unfortunately, mod- ern off-the-shelf microprocessors fall significantly short of achieving theoretical lower bound in CPI for high performance computing applications. In this thesis, we perform an in-depth analysis of the available parallelisms and propose suitable algorithmic and architectural variation to significantly improve the computation efficiency. There are two standard approaches for improving the computation effficiency, first, to perform application-specific architecture customization and second, to do algorithmic tuning. In the same manner, we first perform a graph-based analysis of selected DLA kernels. From the various forms of parallelism, thus identified, we design a custom processing element for improving the CPI. The processing elements are used as building blocks for a commercially available Coarse-Grained Reconfigurable Architecture (CGRA). By per- forming detailed experiments on a synthesized CGRA implementation, we demonstrate that our proposed algorithmic and architectural variations are able to achieve lower CPI compared to off-the-shelf microprocessors. We also benchmark against state-of-the-art custom implementations to report higher energy-performance-area product. DLA computations are encountered in many engineering and scientific computing ap- plications ranging from Computational Fluid Dynamics (CFD) to Eigenvalue problem. Traditionally, these applications are written in highly tuned High Performance Comput- ing (HPC) software packages like Linear Algebra Package (LAPACK), and/or Scalable Linear Algebra Package (ScaLAPACK). The basic building block for these packages is Ba- sic Linear Algebra Subprograms (BLAS). Algorithms pertaining LAPACK/ScaLAPACK are written in-terms of BLAS to achieve high throughput. Despite extensive intellectual efforts in development and tuning of these packages, there still exists a scope for fur- ther tuning in this packages. In this thesis, we revisit most prominent and widely used compute bound algorithms like GMM for further exploitation of Instruction Level Parallelism (ILP). We further look into LU and QR factorizations for generalizations and exhibit higher ILP in these algorithms. We first accelerate sequential performance of the algorithms in BLAS and LAPACK and then focus on the parallel realization of these algorithms. Major contributions in the algorithmic tuning in this thesis are as follows: Algorithms: We present graph based analysis of General Matrix Multiplication (GMM) and discuss different types of parallelisms available in GMM We present analysis of Givens Rotation based QR factorization where we improve GR and derive Column-wise GR (CGR) that can annihilate multiple elements of a column of a matrix simultaneously. We show that the multiplications in CGR are lower than GR We generalize CGR further and derive Generalized GR (GGR) that can annihilate multiple elements of the columns of a matrix simultaneously. We show that the parallelism exhibited by GGR is much higher than GR and Householder Transform (HT) We extend generalizations to Square root Free GR (also knows as Fast Givens Rotation) and Square root and Division Free GR (SDFG) and derive Column-wise Fast Givens, and Column-wise SDFG . We also extend generalization for complex matrices and derive Complex Column-wise Givens Rotation Coarse-grained Recon gurable Architectures (CGRAs) have gained popularity in the last decade due to their power and area efficiency. Furthermore, CGRAs like REDEFINE also exhibit support for domain customizations. REDEFINE is an array of Tiles where each Tile consists of a Compute Element and a Router. The Routers are responsible for on-chip communication, while Compute Elements in the REDEFINE can be domain customized to accelerate the applications pertaining to the domain of interest. In this thesis, we consider REDEFINE base architecture as a starting point and we design Processing Element (PE) that can execute algorithms in BLAS and LAPACK efficiently. We perform several architectural enhancements in the PE to approach lower bound of the CPI. For parallel realization of BLAS and LAPACK, we attach this PE to the Router of REDEFINE. We achieve better area and power performance compared to the yesteryear customized architecture for DLA. Major contributions in architecture in this thesis are as follows: Architecture: We present design of a PE for acceleration of GMM which is a Level-3 BLAS operation We methodically enhance the PE with different features for improvement in the performance of GMM For efficient realization of Linear Algebra Package (LAPACK), we use PE that can efficiently execute GMM and show better performance For further acceleration of LU and QR factorizations in LAPACK, we identify macro operations encountered in LU and QR factorizations, and realize them on a reconfigurable data-path resulting in 25-30% lower run-time

APA, Harvard, Vancouver, ISO, and other styles

38

Shehan, Basher [Verfasser]. "Dynamic coarse grained reconfigurable architectures / presented by Basher Shehan." 2010. http://d-nb.info/1010124390/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

39

Jiang, Jun-Bin, and 江俊賓. "A Predicate-Aware Modulo Scheduling for Coarse Grained Reconfigurable Architectures." Thesis, 2011. http://ndltd.ncl.edu.tw/handle/qrf68u.

Full text

Abstract:

碩士
國立交通大學
電機學院IC設計產業專班
100
To balance the efficiency and flexibility, a coarse-grain reconfigurable architecture (CGRA) is proposed, which exploits the parallelism of a program without compromising of its flexibility. However, how to find more operation parallelism is a complicated problem for compilation. Modulo scheduling is one of the most adopted operation scheduling techniques in recent years, which introduces more parallelism by overlapping the iterations of a loop. Although modulo scheduling parallelizes lots of operations, we still observe that hardware resources is limited by 37.8% conditional executed operations. In this research, we propose a predicate-aware modulo scheduling which may map two disjoint operations into a same processing element to reduce the requirements of hardware resources; meanwhile, the corresponding architecture is also proposed. In addition, a weighted cost value mapping decision selection heuristic is designed to improve execution performance for the reconfigurable architecture. Our experimental results indicate that the initial interval of a loop of the selected benchmarks can be reduced by 12% to 25.2% compared with a related work and there is still 18 % reduction when compared with the related work that are equipped more resources.

APA, Harvard, Vancouver, ISO, and other styles

40

"Register File Organization for Coarse-Grained Reconfigurable Architectures: Compiler-Microarchitecture Perspective." Master's thesis, 2014. http://hdl.handle.net/2286/R.I.25844.

Full text

Abstract:

abstract: Coarse-Grained Reconfigurable Architectures (CGRA) are a promising fabric for improving the performance and power-efficiency of computing devices. CGRAs are composed of components that are well-optimized to execute loops and rotating register file is an example of such a component present in CGRAs. Due to the rotating nature of register indexes in rotating register file, it is very challenging, if at all possible, to hold and properly index memory addresses (pointers) and static values. In this Thesis, different structures for CGRA register files are investigated. Those structures are experimentally compared in terms of performance of mapped applications, design frequency, and area. It is shown that a register file that can logically be partitioned into rotating and non-rotating regions is an excellent choice because it imposes the minimum restriction on underlying CGRA mapping algorithm while resulting in efficient resource utilization.
Dissertation/Thesis
Masters Thesis Computer Science 2014

APA, Harvard, Vancouver, ISO, and other styles

41

Obeid, Abdulfattah Mohammad. "Architectural Synthesis of a Coarse-Grained Run-Time-Reconfigurable Accelerator for DSP Applications." Phd thesis, 2006. https://tuprints.ulb.tu-darmstadt.de/668/1/ObeidDissG_Part1v2.pdf.

Full text

Abstract:

Given all its merits and potential, Reconfigurable Computing has attracted lots of research work. Reconfiguration costs as well as new Reconfigurable Computing specific challenges have so far been the main obstacles hindering reaching optimal reconfigurable computing solutions. Because of the flexibility offered by Reconfigurable Computing many new design parameters that were previously unknown now exist. Dynamic reconfiguration, partial reconfiguration, context management and HW/SW issues are among these. Depending on the target set of applications, different design decisions can be made in order to optimize the reconfigurable solution according to the target application constraints. In this thesis the HPad, an efficient coarse-grained dynamically reconfigurable solution targeted for DSP computation, is proposed. The HPad architecture was greatly influenced by reported VLSI architectures of a variety of DSP algorithms. Based on observations of the characteristics of these DSP algorithms and their architectures the HPad was chosen to be a heterogeneous and dynamically reconfigurable coarse grained solution. The HPad features partial, dynamic, and background reconfiguration capabilities. In addition, the HPad data path architecture is tailored to efficiently realize the studied DSP applications. Through the use of local reconfiguration interface sockets around each processing element, the dynamic reconfiguration problem is partitioned and efficiently solved. The HPad was modeled and synthesized with a parameterizable VHDL code written at the RTL level. Parameterizing the code was beneficial since it permitted generation of new designs simply by changing a few constants and recompiling. The model consisted of several thousand lines of code. Mapping and routing of several pipelined architectures of DSP algorithms were examined to demonstrate the suitability and validity of the HPad to the proposed scope of

APA, Harvard, Vancouver, ISO, and other styles

42

Liu, Xiaobin. "ENERGY EFFICIENCY EXPLORATION OF COARSE-GRAIN RECONFIGURABLE ARCHITECTURE WITH EMERGING NONVOLATILE MEMORY." 2015. https://scholarworks.umass.edu/masters_theses_2/159.

Full text

Abstract:

With the rapid growth in consumer electronics, people expect thin, smart and powerful devices, e.g. Google Glass and other wearable devices. However, as portable electronic products become smaller, energy consumption becomes an issue that limits the development of portable systems due to battery lifetime. In general, simply reducing device size cannot fully address the energy issue. To tackle this problem, we propose an on-chip interconnect infrastructure and pro- gram storage structure for a coarse-grained reconfigurable architecture (CGRA) with emerging non-volatile embedded memory (MRAM). The interconnect is composed of a matrix of time-multiplexed switchboxes which can be dynamically reconfigured with the goal of energy reduction. The number of processors performing computation can also be adapted. The use of MRAM provides access to high-density storage and lower memory energy consumption versus more standard SRAM technologies. The combination of CGRA, MRAM, and flexible on-chip interconnection is considered for signal processing. This application domain is of interest based on its time-varying computing demands. To evaluate CGRA architectural features, prototype architectures have been pro- totyped in a field-programmable gate array (FPGA). Measurements of energy, power, instruction count, and execution time performance are considered for a scalable num- ber of processors. Applications such as adaptive Viterbi decoding and Reed Solomon coding are used for evaluation. To complete this thesis, a time-scheduled switchbox was integrated into our CGRA model. This model was prototyped on an FPGA. It is shown that energy consumption can be reduced by about 30% if dynamic design reconfiguration is performed.

APA, Harvard, Vancouver, ISO, and other styles

43

Obeid, Abdulfattah Mohammad [Verfasser]. "Architectural synthesis of a coarse-grained run-time-reconfigurable accelerator for DSP applications / Abdulfattah Mohammad Obeid." 2006. http://d-nb.info/979006651/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

44

Huang, Yin-Hao, and 黃胤豪. "Morphological Behaviors of Rod-like Polymers with Different Architectures of Side-chain: a GPU-accelerated Coarse-grained Molecular Dynamics Study." Thesis, 2012. http://ndltd.ncl.edu.tw/handle/07022722783803605959.

Full text

Abstract:

碩士
國立臺灣大學
高分子科學與工程學研究所
100
The GPU-accelerated coarse-grained molecular dynamics is adopted to investigate the effects of various side-chain architectures referring to compositions of main-chain, grafting densities of side-chains with different sizes of side-chain beads and rigidity of main-chains on the morphological behaviors of rod-like polymers. We set our parameters according to the well packing of poly(3-hexylthiophene)s in melting state. The self-assembly behaviors were observed in single-layered lamellae, Smectic C, and Nematic phase with different architectures of side-chains in 100% grafting density. The order parameters were decreased rapidly below the composition of main-chain by 0.2 with all architectures of side-chains due to the interfacial energy between different kinds of coarse-grained beads, main-chain and side-chain beads. When the fraction of main-chain was fixed in 0.5, the hexagonal cylinder and double-layered lamellae were found in 50% and 25% grafting density of linear side-chains. The aggregation phenomena of main-chains by 50% grafting density side-chains is more closely packed than 25% grafting density on different side-chain architectures. Besides, multi-stranded helixs were observed by increasing the bead size with different architectures of side-chains in 50% grafting density system. In the end, the spring constant of dihedral angles of main-chains were deduced to study the effects by breaking the coplanarity of main-chains on the morphological behaviors of rod-like polymers, and a destruction of ordered state has be found in this part.

APA, Harvard, Vancouver, ISO, and other styles

45

Satrawala, Amar Nath. "RETHROTTLE : Execution Throttling In The REDEFINE SoC Architecture." Thesis, 2009. http://hdl.handle.net/2005/1017.

Full text

Abstract:

REDEFINE is a reconfigurable SoC architecture that provides a unique platform for high performance and low power computing by exploiting the synergistic interaction between coarse grain dynamic dataflow model of computation (to expose abundant parallelism in the applications) and runtime composition of efficient compute structures (on the reconfigurable computation resources). Computer architectures based on the dynamic dataflow model of computation have to be an infinite resource implementation to be able to exploit all available parallelism in all applications. It is not feasible for any real architectural implementation. When limited resource implementations are considered, there is a possibility of loss of performance (inability to efficiently exploit available parallelism). In this thesis, we study the throttling of execution in the REDEFINE architecture to maximize the architecture efficiency. We have formulated it as a design space exploration problem at two levels i.e. architectural configurations and throttling schemes. Reduced feature/high level simulation or feature specific analytical approaches are very useful for the selective study/exploration of early in design phase architectures/systems. Our approach is similar to that of SEASAME Framework which is used for the study of MPSoC (Multiprocessor SoC) architectures. We have used abstraction (feature reduction) at the levels of architecture and model of computation to make the problem approachable and practically feasible. A feature specific fast hybrid (mixed level) simulation framework for the early in design phase study is developed and implemented for the huge design space exploration (1284 throttling schemes, 128 architectural configurations and 10 applications i.e. 1.6 million executions). We have done performance modeling in terms of selection of important performance criteria, ranking of the explored throttling schemes and investigation of the effectiveness of the design space exploration using statistical hypothesis testing. We found some interesting obvious/intuitive and some non-obvious/counterintuitive results. The two performance criteria namely Exec.T and Avg.TU were found sufficient to represent the performance and the resource usage characteristics of the architecture independent of the throttling schemes, the architectural configurations and the applications. The ranking of the throttling schemes based on the selected performance criteria is found to be statistically very significant. The intuitive throttling schemes span the range of performance from the best to the worst. We found absence of trade-off amongst all of the performance criteria. The best throttling schemes give appreciable overall performance (25%) and resource usage (37%) gains in the throttling unit simultaneously. The design space exploration of the throttling schemes is found to be fine and uniform.

APA, Harvard, Vancouver, ISO, and other styles

46

Γεωργιόπουλος, Σταύρος. "Μεθοδολογίες μεταγλώττισης σε επαναπροσδιοριζόμενα συστήματα αρχιτεκτονικών πίνακα." Thesis, 2011. http://hdl.handle.net/10889/5806.

Full text

Abstract:

Το αντικείμενο της παρούσας διδακτορικής διατριβής εστιάζεται στην ανάπτυξη αποδοτικών τεχνικών μεταγλώττισης για επαναπροσδιοριζόμενα ολοκληρωμένα συστήματα αρχιτεκτονικών πίνακα. Χρησιμοποιήθηκαν εφαρμογές που κυριαρχούνται από δεδομένα για τον έλεγχο των μεθοδολογιών. Σκοπός είναι να βελτιστοποιηθεί η εκτέλεση των εφαρμογών ως προς χαρακτηριστικά των επαναπροσδιοριζόμενων συστημάτων όπως η απόδοση, ο αριθμός εντολών ανά κύκλο ρολογιού, η επιφάνεια ολοκλήρωσης και ο βαθμός χρησιμοποίησης των επεξεργαστικών πόρων. Αυτό επιτυγχάνεται με την εισαγωγή πρωτότυπων τεχνικών χαρτογράφησης αλλά και την εύρεση βέλτιστων αρχιτεκτονικών. Στο πρώτο τμήμα της διατριβής υλοποιήθηκε η έρευνα, ανάπτυξη και αυτοματοποίηση τεχνικών μεταγλώττισης για επαναπροσδιοριζόμενα συστήματα αρχιτεκτονικών πίνακα. Κύριο χαρακτηριστικό αυτών των αρχιτεκτονικών είναι ύπαρξη μεγάλου αριθμού επεξεργαστικών στοιχείων που δουλεύουν παράλληλα με αποτέλεσμα να επιταχύνουν την εκτέλεση εφαρμογών που εμφανίζουν παραλληλία πράξεων. Η λειτουργία τους σε ενσωματωμένα συστήματα είναι αυτή ενός συνεπεξεργαστή. Η έρευνα πάνω σε επαναπροσδιοριζόμενες αρχιτεκτονικές πίνακα έχει αποκτήσει μεγάλο ενδιαφέρον λόγω της ευελιξίας, της επεκτασιμότητας και της απόδοσής τους, ιδιαίτερα σε εφαρμογές που κυριαρχούνται από δεδομένα. Η μεταγλώττιση, όμως, εφαρμογών πάνω σε αυτές χαρακτηρίζεται από υψηλή πολυπλοκότητα. Απαιτούνται κατάλληλα εργαλεία και ειδικές μεθοδολογίες χαρτογράφησης για την εκμετάλλευση των χαρακτηριστικών αυτών των αρχιτεκτονικών. Με αυτό το σκεπτικό, προτάθηκε μια πρωτότυπη επαναστοχεύσιμη μεθοδολογία χαρτογράφησης εφαρμογών, η οποία επιπλέον έχει αυτοματοποιηθεί με τη χρήση ενός πρότυπου εργαλείου μεταγλώττισης που στοχεύει σε ένα αρχιτεκτονικό παραμετρικό πρότυπο. Αποτέλεσμα ήταν η εύρεση των βέλτιστων αρχιτεκτονικών με βάσει την απόδοση, τον αριθμό των εντολών ανά κύκλο ρολογιού και το χρόνο εκτέλεσης του εργαλείου, για μια ομάδα εφαρμογών. Η αποδοτικότητα μιας επαναπροσδιοριζόμενης αρχιτεκτονικής πίνακα ως προς την ταχύτητα και το κόστος σε υλικό είναι δύσκολο να μετρηθεί, για αυτό έχουν υπάρξει λίγες έρευνες που μελετούν την επίδραση αρχιτεκτονικών παραμέτρων πάνω σε παράγοντες όπως η επιφάνεια ολοκλήρωσης και ο αριθμός εντολών ανά κύκλο ρολογιού. Επιπλέον, καμιά εργασία δεν έχει εξετάσει την επίδραση πολλαπλασιαστών ενσωματωμένων στα επεξεργαστικά στοιχεία των επαναπροσδιοριζόμενων αρχιτεκτονικών. Χρησιμοποιώντας την υπάρχουσα επαναστοχεύσιμη μεθοδολογία μεταγλώττισης και μια παραμετρική υλοποίηση της αρχιτεκτονικής σε γλώσσα περιγραφής υλικού, εξετάζουμε την επίδραση των πολλαπλασιαστών από τη μεριά της χαρτογράφησης και της αρχιτεκτονικής. Επίσης, περιγράφεται η πρωτότυπη μεθοδολογία χαρτογράφησης που εισήχθη με σκοπό την αποδοτική λειτουργία του αλγορίθμου Fast Fourier Transform (FFT) πάνω σε επαναπροσδιοριζόμενα συστήματα αρχιτεκτονικών πίνακα. Ο αλγόριθμος FFT χαρακτηρίζεται από μεγάλο αριθμό πράξεων κυρίως πολλαπλασιασμών που επιβραδύνουν την απόδοση μιας επαναπροσδιοριζόμενης αρχιτεκτονικής. Εκμεταλλευόμενοι την ύπαρξη εσωτερικής επαναληπτικής δομής μέσα στον αλγόριθμο και χρησιμοποιώντας μια επαναπροσδιοριζόμενη αρχιτεκτονική 16 επεξεργαστικών στοιχείων, αναπτύξαμε μια πρωτότυπη τεχνική χαρτογράφησης. Επιπρόσθετα, η τεχνική μας λαμβάνει υπόψη την ιεραρχία μνήμης μεταξύ κύριας μνήμης και επαναπροσδιοριζόμενης αρχιτεκτονικής για την περαιτέρω επιτάχυνση εκτέλεσης του αλγορίθμου FFT. Η χρήση της προτεινόμενης τεχνικής χαρτογράφησης οδηγεί σε επίτευξη βαθμού χρησιμοποίησης των επεξεργαστικών στοιχείων άνω του 90%, τιμή που είναι τουλάχιστον 37% υψηλότερη από την καλύτερη τιμή της βιβλιογραφίας.
The object of this PhD thesis focuses on developing efficient mapping techniques for coarse grain reconfigurable build arrays. Data intensive applications were used to evaluate the proposed methodologies. The aim is to optimize the applications’ performance on characteristics targeting reconfigurable characteristics such as performance, instructions per cycle, area of integration and processing resource utilization. This is achieved by introducing novel mapping techniques and finding optimal architectures. In the first part of the thesis research, development and automation of mapping techniques was carried out targeting coarse grain reconfigurable arrays. The main feature of these architectures is the presence of a large number of processing elements working in parallel thus speeding up the execution of applications featuring parallel operations. The function of these processing elements in embedded systems resembles that of a coprocessor. The research on reconfigurable array architectures has gained considerable interest because of their flexibility, scalability and performance, particularly in data intensive applications. Nevertheless, compiling these applications on reconfigurable architectures is characterized by high degree of complexity. Appropriate tools and special mapping methodologies are needed to exploit the characteristics of these architectures. Bearing this in mind, we proposed a novel reconfigurable methodology for mapping applications, which has also been automated with the use of a prototype compiler tool aiming at a parametric architectural model. The result was finding the best architectures on the basis of performance, the instructions per cycle term and the tool execution time for a sample set of applications. It is difficult to evaluate the efficiency of a reconfigurable array architecture table in terms of speed and area of integration, so there have been few cases studying the effect of architectural parameters on factors such as surface integration and the number of instructions per clock cycle. Moreover, no work has examined the multipliers’ impact embedded in reconfigurable architectures processing elements. Using the existing reconfigurable mapping methodology and a parametric implementation of the architecture in hardware description language, we examine the effect of multipliers on the part of the mapping phase and architecture. We also describe an original mapping methodology introduced for the purpose of efficiently mapping the Fast Fourier Transform (FFT) algorithm on reconfigurable array architectures. The FFT algorithm is characterized by a large number of operations primarily multiplications that slow the performance of a reconfigurable architecture. Exploiting the existence of an internal structure inside the FFT algorithm and by the use of a reconfigurable architecture template of 16 processing elements, we developed a novel mapping technique. Additionally, our technique takes into account the memory hierarchy between main memory and reconfigurable architecture in order to further accelerate the implementation of the FFT algorithm. Using the proposed mapping technique results in processing elements utilization of over 90% value which is at least 37% better than the best value of the related literature.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Coarse grained architecture'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles