Academic literature on the topic 'Loop tiling'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Loop tiling.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Loop tiling"

1

Renganarayanan, Lakshminarayanan, Daegon Kim, Michelle Mills Strout, and Sanjay Rajopadhye. "Parameterized loop tiling." ACM Transactions on Programming Languages and Systems 34, no. 1 (2012): 1–41. http://dx.doi.org/10.1145/2160910.2160912.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Darte, Alain, Georges-André Silber, and Frédéric Vivien. "Combining Retiming and Scheduling Techniques for Loop Parallelization and Loop Tiling." Parallel Processing Letters 07, no. 04 (1997): 379–92. http://dx.doi.org/10.1142/s0129626497000383.

Full text
Abstract:
Tiling is a technique used for exploiting medium-grain parallelism in nested loops. It relies on a first step that detects sets of permutable nested loops. All algorithms developed so far consider the statements of the loop body as a single block, in other words, they are not able to take advantage of the structure of dependences between different statements. In this paper, we overcame this limitation by showing how the structure of the reduced dependence graph can be taken into account for detecting more permutable loops. Our method combines graph retiming techniques and graph scheduling techniques. It can be viewed as an extension of Wolf and Lam's algorithm to the case of loops with multiple statements. Loan independent dependences play a particular role in our study, and we show how the way we handle them can be useful for fine-grain loop parallelization as well.
APA, Harvard, Vancouver, ISO, and other styles
3

Xue, Jingling. "On Tiling as a Loop Transformation." Parallel Processing Letters 07, no. 04 (1997): 409–24. http://dx.doi.org/10.1142/s0129626497000401.

Full text
Abstract:
This paper is a follow-up Irigoin and Triolet's earlier work and our recent work on tiling. In this paper, tiling is discussed in terms of its effects on the dependences between tiles, the dependences within a tile and the required dependence test for legality. A necessary and sufficient condition is given for enforcing the data dependences of the program, while Irigion and Triolet's atomic tile constraint is only sufficient. A condition is identified under which both Irigoin and Triolet's and our constraints are equivalent. The results of this paper are discussed in terms of their impact on dependence abstractions suitable for legality test and on tiling to optimise a certain given goal.
APA, Harvard, Vancouver, ISO, and other styles
4

Bielecki, Włodzimierz, and Marek Pałkowski. "Tiling arbitrarily nested loops by means of the transitive." International Journal of Applied Mathematics and Computer Science 26, no. 4 (2016): 919–39. http://dx.doi.org/10.1515/amcs-2016-0065.

Full text
Abstract:
Abstract A novel approach to generation of tiled code for arbitrarily nested loops is presented. It is derived via a combination of the polyhedral and iteration space slicing frameworks. Instead of program transformations represented by a set of affine functions, one for each statement, it uses the transitive closure of a loop nest dependence graph to carry out corrections of original rectangular tiles so that all dependences of the original loop nest are preserved under the lexicographic order of target tiles. Parallel tiled code can be generated on the basis of valid serial tiled code by means of applying affine transformations or transitive closure using on input an inter-tile dependence graph whose vertices are represented by target tiles while edges connect dependent target tiles. We demonstrate how a relation describing such a graph can be formed. The main merit of the presented approach in comparison with the well-known ones is that it does not require full permutability of loops to generate both serial and parallel tiled codes; this increases the scope of loop nests to be tiled.
APA, Harvard, Vancouver, ISO, and other styles
5

Parsa, Saeed, and Shahriar Lotfi. "A New Genetic Algorithm for Loop Tiling." Journal of Supercomputing 37, no. 3 (2006): 249–69. http://dx.doi.org/10.1007/s11227-006-6367-9.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Shao, Zhongxi, Shilei Wu, Jinguo Wu, and Hongya Fu. "A novel 5-DOF high-precision compliant parallel mechanism for large-aperture grating tiling." Mechanical Sciences 8, no. 2 (2017): 349–58. http://dx.doi.org/10.5194/ms-8-349-2017.

Full text
Abstract:
Abstract. In combination with the advantages of parallel mechanisms and compliant mechanisms, a 5-DOF compliant parallel mechanism is designed to meet the requirements, such as large stroke, large load capacity, high precision and high stability, for a large-aperture grating tiling device. The structure and characteristics of the 5-DOF compliant parallel mechanism are presented. The kinematics of the mechanism are derived based on a pseudo-rigid-body model as well. To increase the tiling position retention stability of the mechanism, a closed-loop control system with capacitive position sensors, which are employed to provide feedback signals, is realized. A position and orientation monitoring algorithm and a single neuron adaptive full closed-loop control algorithm are proposed. Performance testing is implemented to verify the accuracy and the tiling position retention stability of the grating tiling device. The experimental results indicate that the tiling accuracy reaches 0.2 µrad per step and 20 nm per step, and the tiling position retention stability can achieve 1.2 µrad per 30 min and 35 nm per 30 min in the rotational direction and the translational direction, respectively.
APA, Harvard, Vancouver, ISO, and other styles
7

LEOPOLD, CLAUDIA. "CACHE MISS ANALYSIS OF 2D STENCIL CODES WITH TILED TIME LOOP." International Journal of Foundations of Computer Science 14, no. 01 (2003): 39–58. http://dx.doi.org/10.1142/s0129054103001583.

Full text
Abstract:
Stencil codes such as the Jacobi, Gauß-Seidel, and red-black Gauß-Seidel kernels are among the most time-consuming routines in many scientific and engineering applications. The performance of these codes critically depends on an efficient usage of caches, and can be improved by tiling. Several tiling schemes have been suggested in the literature; this paper gives an overview and comparison. Then, in the main part, we prove a lower bound on the number of cold and capacity misses. Finally, we analyze a particular tiling scheme, and show that it is off the lower bound by a factor of at most ten. Our results show up limitations to the speedup that can be gained by future research.
APA, Harvard, Vancouver, ISO, and other styles
8

LIU, Xiaoxian, Rongcai ZHAO, Rui DING, and Yanbing LI. "Pipelining granularity optimization algorithm based on loop tiling." Journal of Computer Applications 33, no. 8 (2013): 2171–76. http://dx.doi.org/10.3724/sp.j.1087.2013.02171.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Di, Peng, Hui Wu, Jingling Xue, Feng Wang, and Canqun Yang. "Parallelizing SOR for GPGPUs using alternate loop tiling." Parallel Computing 38, no. 6-7 (2012): 310–28. http://dx.doi.org/10.1016/j.parco.2012.03.004.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Bielecki, Wlodzimierz, and Marek Palkowski. "Space-Time Loop Tiling for Dynamic Programming Codes." Electronics 10, no. 18 (2021): 2233. http://dx.doi.org/10.3390/electronics10182233.

Full text
Abstract:
We present a new space-time loop tiling approach and demonstrate its application for the generation of parallel tiled code of enhanced locality for three dynamic programming algorithms. The technique envisages that, for each loop nest statement, sub-spaces are first generated so that the intersection of them results in space tiles. Space tiles can be enumerated in lexicographical order or in parallel by using the wave-front technique. Then, within each space tile, time slices are formed, which are enumerated in lexicographical order. Target tiles are represented with multiple time slices within each space tile. We explain the basic idea of space-time loop tiling and then illustrate it by means of an example. Then, we present a formal algorithm and prove its correctness. The algorithm is implemented in the publicly available TRACO compiler. Experimental results demonstrate that parallel codes generated by means of the presented approach outperform closely related manually generated ones or those generated by using affine transformations. The main advantage of code generated by means of the presented approach is its enhanced locality due to splitting each larger space tile into multiple smaller tiles represented with time slices.
APA, Harvard, Vancouver, ISO, and other styles
More sources

Dissertations / Theses on the topic "Loop tiling"

1

Bernard, Selvaraj Anand Joseph. "Effects of Loop Tiling using Primetile and Dyntile." The Ohio State University, 2010. http://rave.ohiolink.edu/etdc/view?acc_num=osu1285012179.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Grosser, Tobias. "A decoupled approach to high-level loop optimization : tile shapes, polyhedral building blocks and low-level compilers." Thesis, Paris 6, 2014. http://www.theses.fr/2014PA066270/document.

Full text
Abstract:
Malgré des décennies de recherche sur l’optimisation de boucle auxhaut niveau et leur intégration réussie dans les compilateurs C/C++et FORTRAN, la plupart des systèmes de transformation de bouclene traitent que partiellement les défis posé par la complexité croissanteet la diversité du matériel d’aujourd’hui. L’exploitation de laconnaissance dédiée a un domaine d’application pour obtenir le codeoptimal pour cibles complexes, tels que des accélérateurs ou des microprocessorsmulti-coeur, pose des problèmes pour les formalismeset outils d’optimisation de boucle existants. En conséquence, de nouveauxschémas d’optimisation qui exploitent la connaissance dédiéea un domaine sont développées indépendamment sans profiter dela technologie d’optimisation de boucle existante. Cela conduit à despossiblités d’optimisation raté et ainsi qu’à une faible portabilité deces schémas d’optimisation entre des compilateurs différents. Un domainepour lequel on voit la nécessité d’améliorer les optimisationsest le calcul de pochoir itératifs, un probléme de calcul important quiest réguliérement optimisé par les compilateurs dédiées, mais pourlequel générer code efficace est difficile.Dans ce travail, nous présentons des nouvelles stratégies pour l’optimisationdédiée qui permettent la génération de code GPU haute performancepour des calculs de pochoir. À la différence de la façon dontla plupart des compilateurs existants sont mis en oeuvre, nous découplonsla stratégie d’optimisation de haut niveau de l’optimisationde bas niveau et la spécialisation nécessaire pour obtenir la performanceoptimale. Comme schéma d’optimisation de haut niveau, nousprésentons une nouvelle formulation de “split tiling”, une techniquequi permet la réutilisation de données dans la dimension du tempsainsi que le parallélisme équilibré à gros grain sans la nécessité derecourir à des calculs redondants. Avec le “split tiling”, nous montronscomment intégrer une optimisation dédiée dans un traducteurgénérique source-à-source, C vers CUDA, une approche qui nouspermet de réutiliser des optimisations existants non-dédiées. Nousprésentons ensuite notre technique appelée “hybrid hexagonal / parallelogramtiling", un schéma qui nous permet de générer du codeque cible directement les préoccupations spécifiques aux GPUs. Pourconclure notre travail sur le "loop tiling", nous étudions la rapport entre“diamond tiling” et “hexagonal tiling”. À partir d’une analyse de“diamond tiling” détailée, qui comprend les exigences qu’elle posesur la taille de tuile et les coefficients de front d’onde, nous fournissonsune formulation unifiée de l’“hexagonal tiling” et du “diamondtiling” qui nous permet de réaliser un “hexagonal tiling” pourvdes problèmes avec deux dimensions (un temps, un espace) dans lecadre d’un usage dans un optimiseur générique, comme “Pluto”. Enfin,nous utilisons cette formulation pour évaluer l’“hexagonal tiling”et le “diamond tiling” en terme de rapport de calcul-à-communicationet calcul-à-synchronisation.Dans la deuxième partie de ce travail, nous discutons nos contributionsaux composants de l’infrastructure les plus important, nos“building blocks”, qui nous permettent de découpler notre optimisationde haut niveau tant des optimisations nécessaires dàns la générationde code que de l’infrastructure de compilation générique. Nouscommençons par présenter le nouveau “polyhedral extractor” (pet),qui obtient une représentation polyédrique d’un morceau de code C.pet utilise l’arithmétique de Presburger en sa généralité pour élargirle fragment de code C supporté et porter une attention particulièreà la modélisation de la sémantique des langages même en présencede dépassement de capacité des entiers<br>Despite decades of research on high-level loop optimizations and theirsuccessful integration in production C/C++/FORTRAN com- pilers, most compilerinternal loop transformation systems only partially address the challengesposed by the increased complexity and diversity of today’s hardware. Especiallywhen exploiting domain specific knowledge to obtain optimal code for complextargets such as accelerators or many-cores processors, many existing loopoptimization frameworks have difficulties exploiting this hardware. As aresult, new domain specific optimization schemes are developed independentlywithout taking advantage of existing loop optimization technology. This resultsboth in missed optimization opportunities as well as low portability of theseoptimization schemes to different compilers. One area where we see the need forbetter optimizations are iterative stencil computations, an importantcomputational problem that is regularly optimized by specialized, domainspecific compilers, but where generating efficient code is difficult.In this work we present new domain specific optimization strategies that enablethe generation of high-performance GPU code for stencil computations. Differentto how most existing domain specific compilers are implemented, we decouple thehigh-level optimization strategy from the low-level optimization andspecialization necessary to yield optimal performance. As high-leveloptimization scheme we present a new formulation of split tiling, a tilingtechnique that ensures reuse along the time dimension as well as balancedcoarse grained parallelism without the need for redundant computations. Usingsplit tiling we show how to integrate a domain specific optimization into ageneral purpose C-to-CUDA translator, an approach that allows us to reuseexisting non-domain specific optimizations. We then evolve split tiling into ahybrid hexagonal/parallelogram tiling scheme that allows us to generate codethat even better addresses GPU specific concerns. To conclude our work ontiling schemes we investigate the relation between diamond and hexagonaltiling. Starting with a detailed analysis of diamond tiling including therequirements it poses on tile sizes and wavefront coefficients, we provide aunified formulation of hexagonal and diamond tiling which enables us to performhexagonal tiling for two dimensional problems (one time, one space) in thecontext of a general purpose optimizer such as Pluto. Finally, we use thisformulation to evaluate hexagonal and diamond tiling in terms ofcompute-to-communication and compute-to-synchronization ratios.In the second part of this work, we discuss our contributions to importantinfrastructure components, our building blocks, that enviable us to decoupleour high-level optimizations from both the necessary code generationoptimizations as well as the compiler infrastructure we apply the optimizationto. We start with presenting a new polyhedral extractor that obtains apolyhedral representation from a piece of C code, widening the supported C codeto exploit the full generality of Presburger arithmetic and taking special careof modeling language semantics even in the presence of defined integerwrapping. As a next step, we present a new polyhedral AST generation approach,which extends AST generation beyond classical control flow generation byallowing the generation of user provided mappings. Providing a fine-grainedoption mechanism, we give the user fine grained control about AST generatordecisions and add extensive support for specialization e.g., with a newgeneralized form of polyhedral unrolling. To facilitate the implementation ofpolyhedral transformations, we present a new schedule representation, scheduletrees, which proposes to make the inherent tree structure of schedules explicitto simplify the work with complex polyhedral schedules.The last part of this work takes a look at our contributions to low-levelcompilers
APA, Harvard, Vancouver, ISO, and other styles
3

Hartono, Albert. "Tools for Performance Optimizations and Tuning of Affine Loop Nests." The Ohio State University, 2009. http://rave.ohiolink.edu/etdc/view?acc_num=osu1259685041.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Gao, Xiaoyang. "Integrated compiler optimizations for tensor contractions." Columbus, Ohio : Ohio State University, 2008. http://rave.ohiolink.edu/etdc/view?acc%5Fnum=osu1198874631.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Amstel, Duco van. "Optimisation de la localité des données sur architectures manycœurs." Thesis, Université Grenoble Alpes (ComUE), 2016. http://www.theses.fr/2016GREAM019/document.

Full text
Abstract:
L'évolution continue des architectures des processeurs a été un moteur important de la recherche en compilation. Une tendance dans cette évolution qui existe depuis l'avènement des ordinateurs modernes est le rapport grandissant entre la puissance de calcul disponible (IPS, FLOPS, ...) et la bande-passante correspondante qui est disponible entre les différents niveaux de la hiérarchie mémoire (registres, cache, mémoire vive). En conséquence la réduction du nombre de communications mémoire requis par un code donnée a constitué un sujet de recherche important. Un principe de base en la matière est l'amélioration de la localité temporelle des données: regrouper dans le temps l'ensemble des accès à une donnée précise pour qu'elle ne soit requise que pendant peu de temps et pour qu'elle puisse ensuite être transféré vers de la mémoire lointaine (mémoire vive) sans communications supplémentaires.Une toute autre évolution architecturale a été l'arrivée de l'ère des multicoeurs et au cours des dernières années les premières générations de processeurs manycoeurs. Ces architectures ont considérablement accru la quantité de parallélisme à la disposition des programmes et algorithmes mais ceci est à nouveau limité par la bande-passante disponible pour les communications entres coeurs. Ceci a amené dans le monde de la compilation et des techniques d'optimisation des problèmes qui étaient jusqu'à là uniquement connus en calcul distribué.Dans ce texte nous présentons les premiers travaux sur une nouvelle technique d'optimisation, le pavage généralisé qui a l'avantage d'utiliser un modèle abstrait pour la réutilisation des données et d'être en même temps utilisable dans un grand nombre de contextes. Cette technique trouve son origine dans le pavage de boucles, une techniques déjà bien connue et qui a été utilisée avec succès pour l'amélioration de la localité des données dans les boucles imbriquées que ce soit pour les registres ou pour le cache. Cette nouvelle variante du pavage suit une vision beaucoup plus large et ne se limite pas au cas des boucles imbriquées. Elle se base sur une nouvelle représentation, le graphe d'utilisation mémoire, qui est étroitement lié à un nouveau modèle de besoins en termes de mémoire et de communications et qui s'applique à toute forme de code exécuté itérativement. Le pavage généralisé exprime la localité des données comme un problème d'optimisation pour lequel plusieurs solutions sont proposées. L'abstraction faite par le graphe d'utilisation mémoire permet la résolution du problème d'optimisation dans différents contextes. Pour l'évaluation expérimentale nous montrons comment utiliser cette nouvelle technique dans le cadre des boucles, imbriquées ou non, ainsi que dans le cas des programmes exprimés dans un langage à flot-de-données. En anticipant le fait d'utiliser le pavage généralisé pour la distribution des calculs entre les cœurs d'une architecture manycoeurs nous donnons aussi des éléments de réponse pour modéliser les communications et leurs caractéristiques sur ce genre d'architectures. En guise de point final, et pour montrer l'étendue de l'expressivité du graphe d'utilisation mémoire et le modèle de besoins en mémoire et communications sous-jacent, nous aborderons le sujet du débogage de performances et l'analyse des traces d'exécution. Notre but est de fournir un retour sur le potentiel d'amélioration en termes de localité des données du code évalué. Ce genre de traces peut contenir des informations au sujet des communications mémoire durant l'exécution et a de grandes similitudes avec le problème d'optimisation précédemment étudié. Ceci nous amène à une brève introduction dans le monde de l'algorithmique des graphes dirigés et la mise-au-point de quelques nouvelles heuristiques pour le problème connu de joignabilité mais aussi pour celui bien moins étudié du partitionnement convexe<br>The continuous evolution of computer architectures has been an important driver of research in code optimization and compiler technologies. A trend in this evolution that can be traced back over decades is the growing ratio between the available computational power (IPS, FLOPS, ...) and the corresponding bandwidth between the various levels of the memory hierarchy (registers, cache, DRAM). As a result the reduction of the amount of memory communications that a given code requires has been an important topic in compiler research. A basic principle for such optimizations is the improvement of temporal data locality: grouping all references to a single data-point as close together as possible so that it is only required for a short duration and can be quickly moved to distant memory (DRAM) without any further memory communications.Yet another architectural evolution has been the advent of the multicore era and in the most recent years the first generation of manycore designs. These architectures have considerably raised the bar of the amount of parallelism that is available to programs and algorithms but this is again limited by the available bandwidth for communications between the cores. This brings some issues thatpreviously were the sole preoccupation of distributed computing to the world of compiling and code optimization techniques.In this document we present a first dive into a new optimization technique which has the promise of offering both a high-level model for data reuses and a large field of potential applications, a technique which we refer to as generalized tiling. It finds its source in the already well-known loop tiling technique which has been applied with success to improve data locality for both register and cache-memory in the case of nested loops. This new "flavor" of tiling has a much broader perspective and is not limited to the case of nested loops. It is build on a new representation, the memory-use graph, which is tightly linked to a new model for both memory usage and communication requirements and which can be used for all forms of iterate code.Generalized tiling expresses data locality as an optimization problem for which multiple solutions are proposed. With the abstraction introduced by the memory-use graph it is possible to solve this optimization problem in different environments. For experimental evaluations we show how this new technique can be applied in the contexts of loops, nested or not, as well as for computer programs expressed within a dataflow language. With the anticipation of using generalized tiling also to distributed computations over the cores of a manycore architecture we also provide some insight into the methods that can be used to model communications and their characteristics on such architectures.As a final point, and in order to show the full expressiveness of the memory-use graph and even more the underlying memory usage and communication model, we turn towards the topic of performance debugging and the analysis of execution traces. Our goal is to provide feedback on the evaluated code and its potential for further improvement of data locality. Such traces may contain information about memory communications during an execution and show strong similarities with the previously studied optimization problem. This brings us to a short introduction to the algorithmics of directed graphs and the formulation of some new heuristics for the well-studied topic of reachability and the much less known problem of convex partitioning
APA, Harvard, Vancouver, ISO, and other styles
6

Rajaraman, Bhargavi. "IMPLEMENTATION AND EVALUATION OF REGISTER TILING FOR PERFECTLY NESTED LOOPS." The Ohio State University, 2009. http://rave.ohiolink.edu/etdc/view?acc_num=osu1245123518.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Thapper, Johan. "Combinatorial Considerations on Two Models from Statistical Mechanics." Licentiate thesis, Linköping : Matematiska institutionen, Linköpings universitet, 2007. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-10141.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Pasca, Bogdan Mihai. "Calcul flottant haute performance sur circuits reconfigurables." Phd thesis, Ecole normale supérieure de lyon - ENS LYON, 2011. http://tel.archives-ouvertes.fr/tel-00654121.

Full text
Abstract:
De plus en plus de constructeurs proposent des accélérateurs de calculs à base de circuits reconfigurables FPGA, cette technologie présentant bien plus de souplesse que le microprocesseur. Valoriser cette flexibilité dans le domaine de l'accélération de calcul flottant en utilisant les langages de description de circuits classiques (VHDL ou Verilog) reste toutefois très difficile, voire impossible parfois. Cette thèse a contribué au développement du logiciel FloPoCo, qui offre aux utilisateurs familiers avec VHDL un cadre C++ de description d'opérateurs arithmétiques génériques adapté au calcul reconfigurable. Ce cadre distingue explicitement la fonctionnalité combinatoire d'un opérateur, et la problématique de son pipeline pour une précision, une fréquence et un FPGA cible donnés. Afin de pouvoir utiliser FloPoCo pour concevoir des opérateurs haute performance en virgule flottante, il a fallu d'abord concevoir des blocs de bases optimisés. Nous avons d'abord développé des additionneurs pipelinés autour des lignes de propagation de retenue rapides, puis, à l'aide de techniques de pavages, nous avons conçu de gros multiplieurs, possiblement tronqués, utilisant des petits multiplieurs. L'évaluation de fonctions élémentaires en flottant implique souvent l'évaluation en virgule fixe d'une fonction. Nous présentons un opérateur générique de FloPoCo qui prend en entrée l'expression de la fonction à évaluer, avec ses précisions d'entrée et de sortie, et construit un évaluateur polynomial optimisé de cette fonction. Ce bloc de base a permis de développer des opérateurs en virgule flottante pour la racine carrée et l'exponentielle qui améliorent considérablement l'état de l'art. Nous avons aussi travaillé sur des techniques de compilation avancée pour adapter l'exécution d'un code C aux pipelines flexibles de nos opérateurs. FloPoCo a pu ainsi être utilisé pour implanter sur FPGA des applications complètes.
APA, Harvard, Vancouver, ISO, and other styles
9

Bhaskaracharya, Somashekaracharya G. "Automatic Storage Optimization of Arrays Affine Loop Nests." Thesis, 2016. http://hdl.handle.net/2005/3208.

Full text
Abstract:
Efficient memory usage is crucial for data-intensive applications as a smaller memory footprint ensures better cache performance and allows one to run a larger problem size given a axed amount of main memory. The solutions found by existing techniques for automatic storage optimization for arrays in a new loop-nests, which minimize the storage requirements for the arrays, are often far from good or optimal and could even miss nearly all storage optimization potential. In this work, we present a new automatic storage optimization framework and techniques that can be used to achieve intra-array as well as inter-array storage reuse within a new loop-nests with a pre-determined schedule. Over the last two decades, several heuristics have been developed for achieving complex transformations of a new loop-nests using the polyhedral model. However, there are no comparably strong heuristics for tackling the problem of automatic memory footprint optimization. We tackle the problem of storage optimization for arrays by formulating it as one of ending the right storage partitioning hyperplanes: each storage partition corresponds to a single storage location. Statement-wise storage partitioning hyperplanes are determined that partition a unit end global array space so that values with overlapping live ranges are not mapped to the same partition. Our integrated heuristic for exploiting intra-array as well as inter-array reuse opportunities is driven by a fourfold objective function that not only minimizes the dimensionality and storage requirements of arrays required for each high-level statement, but also maximizes inter-statement storage reuse. We built an automatic polyhedral storage optimizer called SMO using our storage partitioning approach. Storage reduction factors and other results we report from SMO demon-strate the e activeness of our approach on several benchmarks drawn from the domains of image processing, stencil computations, high-performance computing, and the class of tiled codes in general. The reductions in storage requirement over previous approaches range from a constant factor to asymptotic in the loop blocking factor or array extents { the latter being a dramatic improvement for practical purposes. As an incidental and related topic, we also studied the problem of polyhedral compilation of graphical data programs. While polyhedral techniques for program transformation are now used in several proprietary and open source compilers, most of the research on poly-herald compilation has focused on imperative languages such as C, where the computation is species in terms of statements with zero or more nested loops and other control structures around them. Graphical data ow languages, where there is no notion of statements or a schedule specifying their relative execution order, have so far not been studied using a powerful transformation or optimization approach. The execution semantics and ref-eventual transparency of data ow languages impose a di errant set of challenges. In this work, we attempt to bridge this gap by presenting techniques that can be used to extract polyhedral representation from data ow programs and to synthesize them from their equivalent polyhedral representation. We then describe Polyglot, a framework for automatic transformation of data ow programs that we built using our techniques and other popular research tools such as Clan and Pluto. For the purpose of experimental evaluation, we used our tools to compile LabVIEW, one of the most widely used data ow programming languages. Results show that data ow programs transformed using our framework are able to outperform those compiled otherwise by up to a factor of seventeen, with a mean speed-up of 2.30 while running on an 8-core Intel system.
APA, Harvard, Vancouver, ISO, and other styles
10

Pananilath, Irshad Muhammed. "An Optimizing Code Generator for a Class of Lattice-Boltzmann Computations." Thesis, 2014. http://hdl.handle.net/2005/3259.

Full text
Abstract:
Lattice-Boltzmann method(LBM), a promising new particle-based simulation technique for complex and multiscale fluid flows, has seen tremendous adoption in recent years in computational fluid dynamics. Even with a state-of-the-art LBM solver such as Palabos, a user still has to manually write his program using the library-supplied primitives. We propose an automated code generator for a class of LBM computations with the objective to achieve high performance on modern architectures. Tiling is a very important loop transformation used to improve the performance of stencil computations by exploiting locality and parallelism. In the first part of the work, we explore diamond tiling, a new tiling technique to exploit the inherent ability of most stencils to allow tile-wise concurrent start. This enables perfect load-balance during execution and reduces the frequency of synchronization required. Few studies have looked at time tiling for LBM codes. We exploit a key similarity between stencils and LBM to enable polyhedral optimizations and in turn time tiling for LBM. Besides polyhedral transformations, we also describe a number of other complementary transformations and post processing necessary to obtain good parallel and SIMD performance on modern architectures. We also characterize the performance of LBM with the Roofline performance model. Experimental results for standard LBM simulations like Lid Driven Cavity, Flow Past Cylinder, and Poiseuille Flow show that our scheme consistently outperforms Palabos–on average by3 x while running on 16 cores of a n Intel Xeon Sandy bridge system. We also obtain a very significant improvement of 2.47 x over the native production compiler on the SPECLBM benchmark.
APA, Harvard, Vancouver, ISO, and other styles

Books on the topic "Loop tiling"

1

Loop tiling for parallelism. Kluwer Academic, 2000.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
2

Xue, Jingling. Loop Tiling for Parallelism. Springer US, 2000.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
3

Xue, Jingling. Loop Tiling for Parallelism. Springer US, 2000. http://dx.doi.org/10.1007/978-1-4615-4337-4.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Xue, Jingling. Loop tiling for parallelism. Kluwer Academic, 2000.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
5

Xue, Jingling. Loop Tiling for Parallelism. Springer, 2000.

Find full text
APA, Harvard, Vancouver, ISO, and other styles

Book chapters on the topic "Loop tiling"

1

Dongarra, Jack, Piotr Luszczek, Paul Feautrier, et al. "Loop Tiling." In Encyclopedia of Parallel Computing. Springer US, 2011. http://dx.doi.org/10.1007/978-0-387-09766-4_2471.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Xue, Jingling. "Rectangular Tiling." In Loop Tiling for Parallelism. Springer US, 2000. http://dx.doi.org/10.1007/978-1-4615-4337-4_3.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Xue, Jingling. "Parallelepiped Tiling." In Loop Tiling for Parallelism. Springer US, 2000. http://dx.doi.org/10.1007/978-1-4615-4337-4_4.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Xue, Jingling. "Communication-Minimal Tiling." In Loop Tiling for Parallelism. Springer US, 2000. http://dx.doi.org/10.1007/978-1-4615-4337-4_6.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Xue, Jingling. "Time-Minimal Tiling." In Loop Tiling for Parallelism. Springer US, 2000. http://dx.doi.org/10.1007/978-1-4615-4337-4_7.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Xue, Jingling. "Mathematical Background." In Loop Tiling for Parallelism. Springer US, 2000. http://dx.doi.org/10.1007/978-1-4615-4337-4_1.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Xue, Jingling. "Nonsingular Transformations and Permutability." In Loop Tiling for Parallelism. Springer US, 2000. http://dx.doi.org/10.1007/978-1-4615-4337-4_2.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Xue, Jingling. "SPMD Code Generation." In Loop Tiling for Parallelism. Springer US, 2000. http://dx.doi.org/10.1007/978-1-4615-4337-4_5.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Derrien, Steven, and Sanjay Rajopadhye. "Loop Tiling for Reconfigurable Accelerators." In Field-Programmable Logic and Applications. Springer Berlin Heidelberg, 2001. http://dx.doi.org/10.1007/3-540-44687-7_41.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Kim, DaeGon, and Sanjay Rajopadhye. "Efficient Tiled Loop Generation: D-Tiling." In Languages and Compilers for Parallel Computing. Springer Berlin Heidelberg, 2010. http://dx.doi.org/10.1007/978-3-642-13374-9_20.

Full text
APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "Loop tiling"

1

Zhao, Jiacheng, Huimin Cui, Yalin Zhang, Jingling Xue, and Xiaobing Feng. "Revisiting Loop Tiling for Datacenters." In ICS '18: 2018 International Conference on Supercomputing. ACM, 2018. http://dx.doi.org/10.1145/3205289.3205306.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Ahmed, N., N. Mateev, and K. Pingali. "Tiling Imperfectly-nested Loop Nests." In ACM/IEEE SC 2000 Conference. IEEE, 2000. http://dx.doi.org/10.1109/sc.2000.10018.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Bin Bao and Chen Ding. "Defensive loop tiling for shared cache." In 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 2013. http://dx.doi.org/10.1109/cgo.2013.6495008.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Lin, Haibo, Tao Liu, Lakshminarayanan Renganarayana, et al. "Automatic Loop Tiling for Direct Memory Access." In Distributed Processing Symposium (IPDPS). IEEE, 2011. http://dx.doi.org/10.1109/ipdps.2011.53.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Griebl, Martin. "On tiling space-time mapped loop nests." In the thirteenth annual ACM symposium. ACM Press, 2001. http://dx.doi.org/10.1145/378580.378740.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Bao, Bin, and Xiaoya Xiang. "Defensive loop tiling for multi-core processor." In the 2012 ACM SIGPLAN Workshop. ACM Press, 2012. http://dx.doi.org/10.1145/2247684.2247701.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Xue, Jingling. "TIME-MINIMAL AND PROCESSOR-TIME-MINIMAL LOOP TILING." In Proceedings of the 4th International Conference. WORLD SCIENTIFIC, 2000. http://dx.doi.org/10.1142/9789812792037_0024.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Strout, Michelle Mills, Fabio Luporini, Christopher D. Krieger, et al. "Generalizing Run-Time Tiling with the Loop Chain Abstraction." In 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS). IEEE, 2014. http://dx.doi.org/10.1109/ipdps.2014.118.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Hammami, Emna, and Yosr Slama. "An Overview on Loop Tiling Techniques for Code Generation." In 2017 IEEE/ACS 14th International Conference on Computer Systems and Applications (AICCSA). IEEE, 2017. http://dx.doi.org/10.1109/aiccsa.2017.168.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Wu, Mingchuan, Ying Liu, Huimin Cui, et al. "Bandwidth-Aware Loop Tiling for DMA-Supported Scratchpad Memory." In PACT '20: International Conference on Parallel Architectures and Compilation Techniques. ACM, 2020. http://dx.doi.org/10.1145/3410463.3414637.

Full text
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!