Academic literature on the topic 'Memory-Intensive Computation'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Memory-Intensive Computation.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Memory-Intensive Computation"

1

OSKIN, MARK, DIANA KEEN, JUSTIN HENSLEY, LUCIAN-VLAD LITA, and FREDERIC T. CHONG. "OPERATING SYSTEMS TECHNIQUES FOR PARALLEL COMPUTATION IN INTELLIGENT MEMORY." Parallel Processing Letters 12, no. 03n04 (September 2002): 311–26. http://dx.doi.org/10.1142/s0129626402001014.

Full text
Abstract:
Advances in DRAM density have led to several proposals to perform computation in memory [1] [2] [3]. Active Pages is a page-based model of intelligent memory that can exploit large amounts of parallel computation in data-intensive applications. With a simple VLIW processor embedded near each page on DRAM, Active Page memory systems achieve up to 1000X speedups over conventional memory systems [4]. Active Pages are specifically designed to support virtualized hardware resources. In this study, we examine operating system techniques that allow Active Page memories to share, or multiplex, embedded VLIW processors across multiple physical Active Pages. We explore the trade-off between individual page-processor performance and page-level multiplexing. We find that hardware costs of computational logic can be reduced from 31% of DRAM chip area to 12%, through multiplexing, without significant loss in performance. Furthermore, manufacturing defects that disable up to 50% of the page processors can be tolerated through efficient resource allocation and associative multiplexing.
APA, Harvard, Vancouver, ISO, and other styles
2

Meena, V., Obulaporam Gireesha, Kannan Krithivasan, and V. S. Shankar Sriram. "Fuzzy simplified swarm optimization for multisite computational offloading in mobile cloud computing." Journal of Intelligent & Fuzzy Systems 39, no. 6 (December 4, 2020): 8285–97. http://dx.doi.org/10.3233/jifs-189148.

Full text
Abstract:
Mobile Cloud Computing (MCC)’s rapid technological advancements facilitate various computational-intensive applications on smart mobile devices. However, such applications are constrained by limited processing power, energy consumption, and storage capacity of smart mobile devices. To mitigate these issues, computational offloading is found to be the one of the promising techniques as it offloads the execution of computation-intensive applications to cloud resources. In addition, various kinds of cloud services and resourceful servers are available to offload computationally intensive tasks. However, their processing speeds, access delays, computation capability, residual memory and service charges are different which retards their usage, as it becomes time-consuming and ambiguous for making decisions. To address the aforementioned issues, this paper presents a Fuzzy Simplified Swarm Optimization based cloud Computational Offloading (FSSOCO) algorithm to achieve optimum multisite offloading. Fuzzy logic and simplified swarm optimization are employed for the identification of high powerful nodes and task decomposition respectively. The overall performance of FSSOCO is validated using the Specjvm benchmark suite and compared with the state-of-the-art offloading techniques in terms of the weighted total cost, energy consumption, and processing time.
APA, Harvard, Vancouver, ISO, and other styles
3

Ahuja, Sanjay P., and Jesus Zambrano. "Mobile Cloud Computing: Offloading Mobile Processing to the Cloud." Computer and Information Science 9, no. 1 (January 31, 2016): 90. http://dx.doi.org/10.5539/cis.v9n1p90.

Full text
Abstract:
<p class="zhengwen">The current proliferation of mobile systems, such as smart phones and tablets, has let to their adoption as the primary computing platforms for many users. This trend suggests that designers will continue to aim towards the convergence of functionality on a single mobile device (such as phone + mp3 player + camera + Web browser + GPS + mobile apps + sensors). However, this conjunction penalizes the mobile system both with respect to computational resources such as processor speed, memory consumption, disk capacity, and in weight, size, ergonomics and the component most important to users, battery life. Therefore, energy consumption and response time are major concerns when executing complex algorithms on mobile devices because they require significant resources to solve intricate problems.</p><p>Offloading mobile processing is an excellent solution to augment mobile capabilities by migrating computation to powerful infrastructures. Current cloud computing environments for performing complex and data intensive computation remotely are likely to be an excellent solution for offloading computation and data processing from mobile devices restricted by reduced resources. This research uses cloud computing as processing platform for intensive-computation workloads while measuring energy consumption and response times on a Samsung Galaxy S5 Android mobile phone running Android 4.1OS.</p>
APA, Harvard, Vancouver, ISO, and other styles
4

Allouche, Mohamed, Tarek Frikha, Mihai Mitrea, Gérard Memmi, and Faten Chaabane. "Lightweight Blockchain Processing. Case Study: Scanned Document Tracking on Tezos Blockchain." Applied Sciences 11, no. 15 (August 3, 2021): 7169. http://dx.doi.org/10.3390/app11157169.

Full text
Abstract:
To bridge the current gap between the Blockchain expectancies and their intensive computation constraints, the present paper advances a lightweight processing solution, based on a load-balancing architecture, compatible with the lightweight/embedding processing paradigms. In this way, the execution of complex operations is securely delegated to an off-chain general-purpose computing machine while the intimate Blockchain operations are kept on-chain. The illustrations correspond to an on-chain Tezos configuration and to a multiprocessor ARM embedded platform (integrated into a Raspberry Pi). The performances are assessed in terms of security, execution time, and CPU consumption when achieving a visual document fingerprint task. It is thus demonstrated that the advanced solution makes it possible for a computing intensive application to be deployed under severely constrained computation and memory resources, as set by a Raspberry Pi 3. The experimental results show that up to nine Tezos nodes can be deployed on a single Raspberry Pi 3 and that the limitation is not derived from the memory but from the computation resources. The execution time with a limited number of fingerprints is 40% higher than using a classical PC solution (value computed with 95% relative error lower than 5%).
APA, Harvard, Vancouver, ISO, and other styles
5

DU, LIU-GE, KANG LI, FAN-MIN KONG, and YUAN HU. "PARALLEL 3D FINITE-DIFFERENCE TIME-DOMAIN METHOD ON MULTI-GPU SYSTEMS." International Journal of Modern Physics C 22, no. 02 (February 2011): 107–21. http://dx.doi.org/10.1142/s012918311101618x.

Full text
Abstract:
Finite-difference time-domain (FDTD) is a popular but computational intensive method to solve Maxwell's equations for electrical and optical devices simulation. This paper presents implementations of three-dimensional FDTD with convolutional perfect match layer (CPML) absorbing boundary conditions on graphics processing unit (GPU). Electromagnetic fields in Yee cells are calculated in parallel millions of threads arranged as a grid of blocks with compute unified device architecture (CUDA) programming model and considerable speedup factors are obtained versus sequential CPU code. We extend the parallel algorithm to multiple GPUs in order to solve electrically large structures. Asynchronous memory copy scheme is used in data exchange procedure to improve the computation efficiency. We successfully use this technique to simulate pointwise source radiation and validate the result by comparison to high precision computation, which shows favorable agreements. With four commodity GTX295 graphics cards on a single personal computer, more than 4000 million Yee cells can be updated in one second, which is hundreds of times faster than traditional CPU computation.
APA, Harvard, Vancouver, ISO, and other styles
6

Wang, Wei, Yiyang Hu, Ting Zou, Hongmei Liu, Jin Wang, and Xin Wang. "A New Image Classification Approach via Improved MobileNet Models with Local Receptive Field Expansion in Shallow Layers." Computational Intelligence and Neuroscience 2020 (August 1, 2020): 1–10. http://dx.doi.org/10.1155/2020/8817849.

Full text
Abstract:
Because deep neural networks (DNNs) are both memory-intensive and computation-intensive, they are difficult to apply to embedded systems with limited hardware resources. Therefore, DNN models need to be compressed and accelerated. By applying depthwise separable convolutions, MobileNet can decrease the number of parameters and computational complexity with less loss of classification precision. Based on MobileNet, 3 improved MobileNet models with local receptive field expansion in shallow layers, also called Dilated-MobileNet (Dilated Convolution MobileNet) models, are proposed, in which dilated convolutions are introduced into a specific convolutional layer of the MobileNet model. Without increasing the number of parameters, dilated convolutions are used to increase the receptive field of the convolution filters to obtain better classification accuracy. The experiments were performed on the Caltech-101, Caltech-256, and Tubingen animals with attribute datasets, respectively. The results show that Dilated-MobileNets can obtain up to 2% higher classification accuracy than MobileNet.
APA, Harvard, Vancouver, ISO, and other styles
7

Xu, Shilin, and Caili Guo. "Computation Offloading in a Cognitive Vehicular Networks with Vehicular Cloud Computing and Remote Cloud Computing." Sensors 20, no. 23 (November 29, 2020): 6820. http://dx.doi.org/10.3390/s20236820.

Full text
Abstract:
To satisfy the explosive growth of computation-intensive vehicular applications, we investigated the computation offloading problem in a cognitive vehicular networks (CVN). Specifically, in our scheme, the vehicular cloud computing (VCC)- and remote cloud computing (RCC)-enabled computation offloading were jointly considered. So far, extensive research has been conducted on RCC-based computation offloading, while the studies on VCC-based computation offloading are relatively rare. In fact, due to the dynamic and uncertainty of on-board resource, the VCC-based computation offloading is more challenging then the RCC one, especially under the vehicular scenario with expensive inter-vehicle communication or poor communication environment. To solve this problem, we propose to leverage the VCC’s computation resource for computation offloading with a perception-exploitation way, which mainly comprise resource discovery and computation offloading two stages. In resource discovery stage, upon the action-observation history, a Long Short-Term Memory (LSTM) model is proposed to predict the on-board resource utilizing status at next time slot. Thereafter, based on the obtained computation resource distribution, a decentralized multi-agent Deep Reinforcement Learning (DRL) algorithm is proposed to solve the collaborative computation offloading with VCC and RCC. Last but not least, the proposed algorithms’ effectiveness is verified with a host of numerical simulation results from different perspectives.
APA, Harvard, Vancouver, ISO, and other styles
8

Yin, Lujia, Yiming Zhang, Zhaoning Zhang, Yuxing Peng, and Peng Zhao. "ParaX." Proceedings of the VLDB Endowment 14, no. 6 (February 2021): 864–77. http://dx.doi.org/10.14778/3447689.3447692.

Full text
Abstract:
Despite the fact that GPUs and accelerators are more efficient in deep learning (DL), commercial clouds like Facebook and Amazon now heavily use CPUs in DL computation because there are large numbers of CPUs which would otherwise sit idle during off-peak periods. Following the trend, CPU vendors have not only released high-performance many-core CPUs but also developed efficient math kernel libraries. However, current DL platforms cannot scale well to a large number of CPU cores, making many-core CPUs inefficient in DL computation. We analyze the memory access patterns of various layers and identify the root cause of the low scalability, i.e., the per-layer barriers that are implicitly imposed by current platforms which assign one single instance (i.e., one batch of input data) to a CPU. The barriers cause severe memory bandwidth contention and CPU starvation in the access-intensive layers (like activation and BN). This paper presents a novel approach called ParaX, which boosts the performance of DL on many-core CPUs by effectively alleviating bandwidth contention and CPU starvation. Our key idea is to assign one instance to each CPU core instead of to the entire CPU, so as to remove the per-layer barriers on the executions of the many cores. ParaX designs an ultralight scheduling policy which sufficiently overlaps the access-intensive layers with the compute-intensive ones to avoid contention, and proposes a NUMA-aware gradient server mechanism for training which leverages shared memory to substantially reduce the overhead of per-iteration parameter synchronization. We have implemented ParaX on MXNet. Extensive evaluation on a two-NUMA Intel 8280 CPU shows that ParaX significantly improves the training/inference throughput for all tested models (for image recognition and natural language processing) by 1.73X ~ 2.93X.
APA, Harvard, Vancouver, ISO, and other styles
9

Alava, Pallavi, and G. Radhika. "Robust and Secure Framework for Mobile Cloud Computing." Asian Journal of Computer Science and Technology 8, S3 (June 5, 2019): 1–6. http://dx.doi.org/10.51983/ajcst-2019.8.s3.2115.

Full text
Abstract:
Smartphone devices are widely utilized in our daily lives. However, these devices exhibit limitations, like short battery lifetime, limited computation power, small memory size and unpredictable network connectivity. Therefore, various solutions have been projected to mitigate these limitations and extend the battery period of time with the employment of the offloading technique. In this paper, a unique framework is projected to offload intensive computation tasks from the mobile device to the cloud. This framework uses Associate in Nursing improvement model to work out the offloading decision dynamically supported four main parameters, namely, energy consumption, CPU utilization, execution time, and memory usage. Additionally, a new security layer is provided to shield the transferred data within the cloud from any attack. The experimental results showed that the framework will choose an acceptable offloading decision for various forms of mobile application tasks whereas achieving important performance improvement. Moreover, different from previous techniques, the framework will defend application knowledge from any threat.
APA, Harvard, Vancouver, ISO, and other styles
10

Piao, Yongri, Zhengkun Rong, Miao Zhang, and Huchuan Lu. "Exploit and Replace: An Asymmetrical Two-Stream Architecture for Versatile Light Field Saliency Detection." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 07 (April 3, 2020): 11865–73. http://dx.doi.org/10.1609/aaai.v34i07.6860.

Full text
Abstract:
Light field saliency detection is becoming of increasing interest in recent years due to the significant improvements in challenging scenes by using abundant light field cues. However, high dimension of light field data poses computation-intensive and memory-intensive challenges, and light field data access is far less ubiquitous as RGB data. These may severely impede practical applications of light field saliency detection. In this paper, we introduce an asymmetrical two-stream architecture inspired by knowledge distillation to confront these challenges. First, we design a teacher network to learn to exploit focal slices for higher requirements on desktop computers and meanwhile transfer comprehensive focusness knowledge to the student network. Our teacher network is achieved relying on two tailor-made modules, namely multi-focusness recruiting module (MFRM) and multi-focusness screening module (MFSM), respectively. Second, we propose two distillation schemes to train a student network towards memory and computation efficiency while ensuring the performance. The proposed distillation schemes ensure better absorption of focusness knowledge and enable the student to replace the focal slices with a single RGB image in an user-friendly way. We conduct the experiments on three benchmark datasets and demonstrate that our teacher network achieves state-of-the-arts performance and student network (ResNet18) achieves Top-1 accuracies on HFUT-LFSD dataset and Top-4 on DUT-LFSD, which tremendously minimizes the model size by 56% and boosts the Frame Per Second (FPS) by 159%, compared with the best performing method.
APA, Harvard, Vancouver, ISO, and other styles
More sources

Dissertations / Theses on the topic "Memory-Intensive Computation"

1

Teng, Sin Yong. "Intelligent Energy-Savings and Process Improvement Strategies in Energy-Intensive Industries." Doctoral thesis, Vysoké učení technické v Brně. Fakulta strojního inženýrství, 2020. http://www.nusl.cz/ntk/nusl-433427.

Full text
Abstract:
S tím, jak se neustále vyvíjejí nové technologie pro energeticky náročná průmyslová odvětví, stávající zařízení postupně zaostávají v efektivitě a produktivitě. Tvrdá konkurence na trhu a legislativa v oblasti životního prostředí nutí tato tradiční zařízení k ukončení provozu a k odstavení. Zlepšování procesu a projekty modernizace jsou zásadní v udržování provozních výkonů těchto zařízení. Současné přístupy pro zlepšování procesů jsou hlavně: integrace procesů, optimalizace procesů a intenzifikace procesů. Obecně se v těchto oblastech využívá matematické optimalizace, zkušeností řešitele a provozní heuristiky. Tyto přístupy slouží jako základ pro zlepšování procesů. Avšak, jejich výkon lze dále zlepšit pomocí moderní výpočtové inteligence. Účelem této práce je tudíž aplikace pokročilých technik umělé inteligence a strojového učení za účelem zlepšování procesů v energeticky náročných průmyslových procesech. V této práci je využit přístup, který řeší tento problém simulací průmyslových systémů a přispívá následujícím: (i)Aplikace techniky strojového učení, která zahrnuje jednorázové učení a neuro-evoluci pro modelování a optimalizaci jednotlivých jednotek na základě dat. (ii) Aplikace redukce dimenze (např. Analýza hlavních komponent, autoendkodér) pro vícekriteriální optimalizaci procesu s více jednotkami. (iii) Návrh nového nástroje pro analýzu problematických částí systému za účelem jejich odstranění (bottleneck tree analysis – BOTA). Bylo také navrženo rozšíření nástroje, které umožňuje řešit vícerozměrné problémy pomocí přístupu založeného na datech. (iv) Prokázání účinnosti simulací Monte-Carlo, neuronové sítě a rozhodovacích stromů pro rozhodování při integraci nové technologie procesu do stávajících procesů. (v) Porovnání techniky HTM (Hierarchical Temporal Memory) a duální optimalizace s několika prediktivními nástroji pro podporu managementu provozu v reálném čase. (vi) Implementace umělé neuronové sítě v rámci rozhraní pro konvenční procesní graf (P-graf). (vii) Zdůraznění budoucnosti umělé inteligence a procesního inženýrství v biosystémech prostřednictvím komerčně založeného paradigmatu multi-omics.
APA, Harvard, Vancouver, ISO, and other styles
2

Mirza, Salma. "Scalable, Memory-Intensive Scientific Computing on Field Programmable Gate Arrays." 2010. https://scholarworks.umass.edu/theses/404.

Full text
Abstract:
Cache-based, general purpose CPUs perform at a small fraction of their maximum floating point performance when executing memory-intensive simulations, such as those required for many scientific computing problems. This is due to the memory bottleneck that is encountered with large arrays that must be stored in dynamic RAM. A system of FPGAs, with a large enough memory bandwidth, and clocked at only hundreds of MHz can outperform a CPU clocked at GHz in terms of floating point performance. An FPGA core designed for a target performance that does not unnecessarily exceed the memory imposed bottleneck can then be distributed, along with multiple memory interfaces, into a scalable architecture that overcomes the bandwidth limitation of a single interface. Interconnected cores can work together to solve a scientific computing problem and exploit a bandwidth that is the sum of the bandwidth available from all of their connected memory interfaces. The implementation demonstrates this concept of scalability with two memory interfaces through the use of available FPGA prototyping platforms. Even though the FPGAs operate at 133 MHz, which is twenty one times slower than an AMD Phenom X4 processor operating at 2.8 GHz, the system of two FPGAs performs eight times slower than the processor for the example problem of SMVM in heat transfer. However, the system is demonstrated to be scalable with a run-time that decreases linearly with respect to the available memory bandwidth. The floating point performance of a single board implementation is 12 GFlops which doubles to 24 GFlops for a two board implementation, for a gather or scatter operation on matrices of varying sizes.
APA, Harvard, Vancouver, ISO, and other styles
3

Lin, Yi-Neng, and 林義能. "Resource Allocation in Multithreaded Multiprocessor Network Processors for Computational Intensive and Memory Access Intensive Network Applications." Thesis, 2007. http://ndltd.ncl.edu.tw/handle/59347016529789938918.

Full text
Abstract:
博士
國立交通大學
資訊科學與工程研究所
95
Networking applications today demand a hardware platform with stronger computational or memory access capabilities as well as the ability to efficiently adapt to changes of protocols or product specifications. Being the ordinary options, however, neither a general purpose processor architecture, which is usually slowed down by kernel-user space communications and context switches, nor an ASIC, which lacks the flexibility and requires much development period, measures up. In this thesis, we discuss (1) the feasibility of applying the emerging alternative, network processors featuring the multithreaded multiprocessor architecture, rich resources, minor context switch overhead, and flexibility, to solve the problem, and (2) the ways of exploiting those resources when dealing with applications of different computational and memory access requirements. We start by surveying network processors which are then categorized into two types, the coprocessors-centric and the core-centric ones. For the former, the coprocessors take care of the data plane manipulation whose load is usually much heavier than the one of the control plane, while in the latter the core processor handles the most part of packet processing, including the control plane and data plane. After that we evaluate real implementations of computational intensive and memory access intensive applications over the coprocessors-centric and core-centric platforms, respectively, aiming to unveil the bottlenecks of the implementations as well as the allocation measures. Finally, based on the evaluations, analytical models are formalized and simulation environments are built to observe possible design implications for these two types of network processors.
APA, Harvard, Vancouver, ISO, and other styles

Book chapters on the topic "Memory-Intensive Computation"

1

Williams, Samuel, Kaushik Datta, Leonid Oliker, Jonathan Carter, John Shalf, and Katherine Yelick. "Auto-Tuning Memory-Intensive Kernels for Multicore." In Chapman & Hall/CRC Computational Science, 273–96. CRC Press, 2010. http://dx.doi.org/10.1201/b10509-14.

Full text
APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "Memory-Intensive Computation"

1

Hamdioui, Said. "Computation in Memory for Data-Intensive Applications." In SCOPES '15: 18th International Workshop on Software and Compilers for Embedded Systems. New York, NY, USA: ACM, 2015. http://dx.doi.org/10.1145/2764967.2771820.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Hamdioui, Said, Lei Xie, Anh Nguyen Hai Anh, Mottaqiallah Taouil, Koen Bertels, Henk Corporaal, Hailong Jiao, et al. "Memrisor Based Computation-in-Memory Architecture for Data-Intensive Applications." In Design, Automation and Test in Europe. New Jersey: IEEE Conference Publications, 2015. http://dx.doi.org/10.7873/date.2015.1136.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Al-Absi, Ahmed Abdulhakim, and Dae-Ki Kang. "A Novel Parallel Computation Model with Efficient Local Memory Management for Data-Intensive Applications." In 2015 IEEE 8th International Conference on Cloud Computing (CLOUD). IEEE, 2015. http://dx.doi.org/10.1109/cloud.2015.150.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Stoller, Daniel, Mi Tian, Sebastian Ewert, and Simon Dixon. "Seq-U-Net: A One-Dimensional Causal U-Net for Efficient Sequence Modelling." In Twenty-Ninth International Joint Conference on Artificial Intelligence and Seventeenth Pacific Rim International Conference on Artificial Intelligence {IJCAI-PRICAI-20}. California: International Joint Conferences on Artificial Intelligence Organization, 2020. http://dx.doi.org/10.24963/ijcai.2020/400.

Full text
Abstract:
Convolutional neural networks (CNNs) with dilated filters such as the Wavenet or the Temporal Convolutional Network (TCN) have shown good results in a variety of sequence modelling tasks. While their receptive field grows exponentially with the number of layers, computing the convolutions over very long sequences of features in each layer is time and memory-intensive, and prohibits the use of longer receptive fields in practice. To increase efficiency, we make use of the "slow feature" hypothesis stating that many features of interest are slowly varying over time. For this, we use a U-Net architecture that computes features at multiple time-scales and adapt it to our auto-regressive scenario by making convolutions causal. We apply our model ("Seq-U-Net") to a variety of tasks including language and audio generation. In comparison to TCN and Wavenet, our network consistently saves memory and computation time, with speed-ups for training and inference of over 4x in the audio generation experiment in particular, while achieving a comparable performance on real-world tasks.
APA, Harvard, Vancouver, ISO, and other styles
5

Acharya, Anurag, and Sanjeev Setia. "Using idle memory for data-intensive computations (extended abstract)." In the 1998 ACM SIGMETRICS joint international conference. New York, New York, USA: ACM Press, 1998. http://dx.doi.org/10.1145/277851.277949.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Pooja and Asmita Pandey. "Impact of memory intensive applications on performance of cloud virtual machine." In 2014 Recent Advances in Engineering and Computational Sciences (RAECS). IEEE, 2014. http://dx.doi.org/10.1109/raecs.2014.6799629.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Brandalero, Marcelo, and Antonio Carlos Beck. "MuTARe: A Multi-Target, Adaptive Reconfigurable Architecture." In XX Simpósio em Sistemas Computacionais de Alto Desempenho. Sociedade Brasileira de Computação - SBC, 2019. http://dx.doi.org/10.5753/wscad_estendido.2019.8706.

Full text
Abstract:
Power consumption, earlier a design constraint only in embedded systems, has become the major driver for architectural optimizations in all domains, from the cloud to the edge. Application-specific accelerators provide a low-power processing solution by efficiently matching the hardware to the application; however, since in many domains the hardware must execute efficiently a broad range of fast-evolving applications, unpredictable at design time and each with distinct resource requirements, alternatives approaches are required. Besides that, the same hardware must also adapt the computational power at run time to the system status and workload sizes. To address these issues, this thesis presents a general-purpose reconfigurable accelerator that can be coupled to a heterogeneous set of cores and supports Dynamic Voltage and Frequency Scaling (DVFS), synergistically combining the techniques for a better match between different applications and hardware when compared to current designs. The resulting architecture, MuTARe, provides a coarse-grained regular and reconfigurable structure which is suitable for automatic acceleration of deployed code through dynamic binary translation. In extension to that, the structure of MuTARe is further leveraged to apply two emerging computing paradigms that can boost the power-efficiency: Near-Threshold Voltage (NTV) computing (while still supporting transparent acceleration) and Approximate Computing (AxC). Compared to a traditional heterogeneous system with DVFS support, the base MuTARe architecture can automatically improve the execution time by up to 1:3×, or adapt to the same task deadline with 1:6× smaller energy consumption, or adapt to the same low energy budget with 2:3× better performance. In NTV mode, MuTARe can transparently save further 30% energy in memory-intensive workloads by operating the combinatorial datapath at half the memory frequency. In AxC mode, MuTARe can further improve power savings by up to 50% by leveraging approximate functional units for arithmetic computations.
APA, Harvard, Vancouver, ISO, and other styles
8

Mittal, Anshul, Sameera D. Wijeyakulasuriya, Dan Probst, Siddhartha Banerjee, Charles E. A. Finney, K. Dean Edwards, Michael Willcox, and Clayton Naber. "Multi-Dimensional Computational Combustion of Highly Dilute, Premixed Spark-Ignited Opposed-Piston Gasoline Engine Using Direct Chemistry With a New Primary Reference Fuel Mechanism." In ASME 2017 Internal Combustion Engine Division Fall Technical Conference. American Society of Mechanical Engineers, 2017. http://dx.doi.org/10.1115/icef2017-3618.

Full text
Abstract:
This work presents a modeling approach for multidimensional combustion simulations of a highly dilute opposed-piston spark-ignited gasoline engine. Detailed chemical kinetics is used to model combustion with no sub-grid correction for reaction rates based on the turbulent fluctuations of temperature and species mass fractions. Turbulence is modeled using RNG k-ε model and the RANS-length scales resolution is done efficiently by the use of automatic mesh refinement when and where the flow parameter curvature (2nd derivative) is large. The laminar flame is thickened by the RANS viscosity and a constant turbulent Schmidt (Sc) number and a refined mesh (sufficient to resolve the thickened turbulent flame) is used to get accurate predictions of turbulent flame speeds. An accurate chemical kinetics mechanism is required to model flame kinetics and fuel burn rates under the conditions of interest. For practical computational fluid dynamics applications, use of large detailed chemistry mechanisms with 1000s of species is both costly as well as memory intensive. For this reason, skeletal mechanisms with a lower number of species (typically ∼100) reduced under specific operating conditions are often used. In this work, a new primary reference fuel chemical mechanism is developed to better correlate with the laminar flame speed data, relevant for highly dilute engine conditions. Simulations are carried out in a dilute gasoline engine with opposed piston architecture, and results are presented here across various dilution conditions.
APA, Harvard, Vancouver, ISO, and other styles
9

Xia, Zhaohui, Qifu Wang, Yunbao Huang, Wei Yixiong, and Wang Yingjun. "Parallel Strategy of FMBEM for 3D Elastostatics and its GPU Implementation Using CUDA." In ASME 2014 International Design Engineering Technical Conferences and Computers and Information in Engineering Conference. American Society of Mechanical Engineers, 2014. http://dx.doi.org/10.1115/detc2014-34587.

Full text
Abstract:
Finite Element Method (FEM1) is pervasively used in most of 3D product design analysis, in which Computer Aided Design (CAD) models need to be converted in to mesh models first and then enriched with some material features and boundary conditions data, etc. The interaction between CAD models and FEM models is intensive. Boundary Element Method (BEM) has been expected to be advantageous in large-scale problems in recent years owing to its reduction of the dimensionality and its reduced complexity in mesh generation. However, the BEM application has so far been limited to relatively small problems due to the memory and computational complexity for matrix buildup are O(N2). The fast multipole BEM (FMBEM) combined with BEM and fast multipole method (FMM) can overcome the defect of the traditional BEM, and provides an effective method to solve the large-scale problem. Combining GPU parallel computing with FMBEM can further improve its efficiency significantly. Based on the three-dimensional elastic mechanics problems, the parallelisms of the multipole moment (ME), multipole moment to multipole moment (M2M) translation, multipole moment to local expansion (M2L) translation, local expansion to local expansion (L2L) translation and near-field direct calculation were analyzed respectively according to the characteristics of the FMM, and the parallel strategies under CUDA were presented in this paper. Three main major parts are included herein: (1) FMBEM theory in 3D elastostatics, (2) the parallel FMBEM algorithm using CUDA, and (3) comparison the GPU parallel FMBEM with BEM, FEM and FMBEM respectively by engineering examples. Numerical example results show the 3D elastostatics GPU FMBEM not only can speed up the boundary element calculation process, but also save memory which can be effective to solve the large-scale engineering problems.
APA, Harvard, Vancouver, ISO, and other styles
10

Vance, Marion W., and Kyle D. Squires. "An Approach to Parallel Computing in an Eulerian-Lagrangian Two-Phase Flow Model." In ASME 2002 Joint U.S.-European Fluids Engineering Division Conference. ASMEDC, 2002. http://dx.doi.org/10.1115/fedsm2002-31225.

Full text
Abstract:
An approach to parallel solution of an Eulerian-Lagrangian model of dilute gas-solid flows is presented. Using Lagrangian treatments for the dispersed phase, one of the principal computational challenges arises in models in which inter-particle interactions are taken into account. Deterministic treatment of particle-particle collisions in the present work pose the most computationally intensive aspect of the simulation. Simple searches lead to algorithms whose cost is O(N2p) where Np is the particle population. The approach developed in the current effort is based on localizing collision detection neighborhoods using a cell-index method and spatially distributing those neighborhoods for parallel solution. The method is evaluated using simulations of the gas-solid turbulent flow in a vertical channel. The instantaneous position and the velocity of any particle is obtained by solving the equation of motion for a small rigid sphere assuming that the resulting force induced by the fluid reduces to the drag contribution. Binary particle collisions without energy dissipation or inter-particle friction are considered. The carrier flow is computed using Large Eddy Simulation of the incompressible Navier-Stokes equations. The entire dispersed-phase population is partitioned via static spatial decomposition of the domain to maximize parallel efficiency. Simulations on small numbers of distributed memory processors show linear speedup in processing of the collision detection step and nearly optimal reductions in simulation time for the entire solution.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography