Log in

Relevant bibliographies by topics / Memory-Intensive Computation / Journal articles

To see the other types of publications on this topic, follow the link: Memory-Intensive Computation.

Journal articles on the topic 'Memory-Intensive Computation'

Author: Grafiati

Published: 4 June 2021

Last updated: 7 February 2022

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Memory-Intensive Computation.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

OSKIN, MARK, DIANA KEEN, JUSTIN HENSLEY, LUCIAN-VLAD LITA, and FREDERIC T. CHONG. "OPERATING SYSTEMS TECHNIQUES FOR PARALLEL COMPUTATION IN INTELLIGENT MEMORY." Parallel Processing Letters 12, no. 03n04 (September 2002): 311–26. http://dx.doi.org/10.1142/s0129626402001014.

Full text

Abstract:

Advances in DRAM density have led to several proposals to perform computation in memory [1] [2] [3]. Active Pages is a page-based model of intelligent memory that can exploit large amounts of parallel computation in data-intensive applications. With a simple VLIW processor embedded near each page on DRAM, Active Page memory systems achieve up to 1000X speedups over conventional memory systems [4]. Active Pages are specifically designed to support virtualized hardware resources. In this study, we examine operating system techniques that allow Active Page memories to share, or multiplex, embedded VLIW processors across multiple physical Active Pages. We explore the trade-off between individual page-processor performance and page-level multiplexing. We find that hardware costs of computational logic can be reduced from 31% of DRAM chip area to 12%, through multiplexing, without significant loss in performance. Furthermore, manufacturing defects that disable up to 50% of the page processors can be tolerated through efficient resource allocation and associative multiplexing.

APA, Harvard, Vancouver, ISO, and other styles

2

Meena, V., Obulaporam Gireesha, Kannan Krithivasan, and V. S. Shankar Sriram. "Fuzzy simplified swarm optimization for multisite computational offloading in mobile cloud computing." Journal of Intelligent & Fuzzy Systems 39, no. 6 (December 4, 2020): 8285–97. http://dx.doi.org/10.3233/jifs-189148.

Full text

Abstract:

Mobile Cloud Computing (MCC)’s rapid technological advancements facilitate various computational-intensive applications on smart mobile devices. However, such applications are constrained by limited processing power, energy consumption, and storage capacity of smart mobile devices. To mitigate these issues, computational offloading is found to be the one of the promising techniques as it offloads the execution of computation-intensive applications to cloud resources. In addition, various kinds of cloud services and resourceful servers are available to offload computationally intensive tasks. However, their processing speeds, access delays, computation capability, residual memory and service charges are different which retards their usage, as it becomes time-consuming and ambiguous for making decisions. To address the aforementioned issues, this paper presents a Fuzzy Simplified Swarm Optimization based cloud Computational Offloading (FSSOCO) algorithm to achieve optimum multisite offloading. Fuzzy logic and simplified swarm optimization are employed for the identification of high powerful nodes and task decomposition respectively. The overall performance of FSSOCO is validated using the Specjvm benchmark suite and compared with the state-of-the-art offloading techniques in terms of the weighted total cost, energy consumption, and processing time.

APA, Harvard, Vancouver, ISO, and other styles

3

Ahuja, Sanjay P., and Jesus Zambrano. "Mobile Cloud Computing: Offloading Mobile Processing to the Cloud." Computer and Information Science 9, no. 1 (January 31, 2016): 90. http://dx.doi.org/10.5539/cis.v9n1p90.

Full text

Abstract:

<p class="zhengwen">The current proliferation of mobile systems, such as smart phones and tablets, has let to their adoption as the primary computing platforms for many users. This trend suggests that designers will continue to aim towards the convergence of functionality on a single mobile device (such as phone + mp3 player + camera + Web browser + GPS + mobile apps + sensors). However, this conjunction penalizes the mobile system both with respect to computational resources such as processor speed, memory consumption, disk capacity, and in weight, size, ergonomics and the component most important to users, battery life. Therefore, energy consumption and response time are major concerns when executing complex algorithms on mobile devices because they require significant resources to solve intricate problems.</p><p>Offloading mobile processing is an excellent solution to augment mobile capabilities by migrating computation to powerful infrastructures. Current cloud computing environments for performing complex and data intensive computation remotely are likely to be an excellent solution for offloading computation and data processing from mobile devices restricted by reduced resources. This research uses cloud computing as processing platform for intensive-computation workloads while measuring energy consumption and response times on a Samsung Galaxy S5 Android mobile phone running Android 4.1OS.</p>

APA, Harvard, Vancouver, ISO, and other styles

4

Allouche, Mohamed, Tarek Frikha, Mihai Mitrea, Gérard Memmi, and Faten Chaabane. "Lightweight Blockchain Processing. Case Study: Scanned Document Tracking on Tezos Blockchain." Applied Sciences 11, no. 15 (August 3, 2021): 7169. http://dx.doi.org/10.3390/app11157169.

Full text

Abstract:

To bridge the current gap between the Blockchain expectancies and their intensive computation constraints, the present paper advances a lightweight processing solution, based on a load-balancing architecture, compatible with the lightweight/embedding processing paradigms. In this way, the execution of complex operations is securely delegated to an off-chain general-purpose computing machine while the intimate Blockchain operations are kept on-chain. The illustrations correspond to an on-chain Tezos configuration and to a multiprocessor ARM embedded platform (integrated into a Raspberry Pi). The performances are assessed in terms of security, execution time, and CPU consumption when achieving a visual document fingerprint task. It is thus demonstrated that the advanced solution makes it possible for a computing intensive application to be deployed under severely constrained computation and memory resources, as set by a Raspberry Pi 3. The experimental results show that up to nine Tezos nodes can be deployed on a single Raspberry Pi 3 and that the limitation is not derived from the memory but from the computation resources. The execution time with a limited number of fingerprints is 40% higher than using a classical PC solution (value computed with 95% relative error lower than 5%).

APA, Harvard, Vancouver, ISO, and other styles

5

DU, LIU-GE, KANG LI, FAN-MIN KONG, and YUAN HU. "PARALLEL 3D FINITE-DIFFERENCE TIME-DOMAIN METHOD ON MULTI-GPU SYSTEMS." International Journal of Modern Physics C 22, no. 02 (February 2011): 107–21. http://dx.doi.org/10.1142/s012918311101618x.

Full text

Abstract:

Finite-difference time-domain (FDTD) is a popular but computational intensive method to solve Maxwell's equations for electrical and optical devices simulation. This paper presents implementations of three-dimensional FDTD with convolutional perfect match layer (CPML) absorbing boundary conditions on graphics processing unit (GPU). Electromagnetic fields in Yee cells are calculated in parallel millions of threads arranged as a grid of blocks with compute unified device architecture (CUDA) programming model and considerable speedup factors are obtained versus sequential CPU code. We extend the parallel algorithm to multiple GPUs in order to solve electrically large structures. Asynchronous memory copy scheme is used in data exchange procedure to improve the computation efficiency. We successfully use this technique to simulate pointwise source radiation and validate the result by comparison to high precision computation, which shows favorable agreements. With four commodity GTX295 graphics cards on a single personal computer, more than 4000 million Yee cells can be updated in one second, which is hundreds of times faster than traditional CPU computation.

APA, Harvard, Vancouver, ISO, and other styles

6

Wang, Wei, Yiyang Hu, Ting Zou, Hongmei Liu, Jin Wang, and Xin Wang. "A New Image Classification Approach via Improved MobileNet Models with Local Receptive Field Expansion in Shallow Layers." Computational Intelligence and Neuroscience 2020 (August 1, 2020): 1–10. http://dx.doi.org/10.1155/2020/8817849.

Full text

Abstract:

Because deep neural networks (DNNs) are both memory-intensive and computation-intensive, they are difficult to apply to embedded systems with limited hardware resources. Therefore, DNN models need to be compressed and accelerated. By applying depthwise separable convolutions, MobileNet can decrease the number of parameters and computational complexity with less loss of classification precision. Based on MobileNet, 3 improved MobileNet models with local receptive field expansion in shallow layers, also called Dilated-MobileNet (Dilated Convolution MobileNet) models, are proposed, in which dilated convolutions are introduced into a specific convolutional layer of the MobileNet model. Without increasing the number of parameters, dilated convolutions are used to increase the receptive field of the convolution filters to obtain better classification accuracy. The experiments were performed on the Caltech-101, Caltech-256, and Tubingen animals with attribute datasets, respectively. The results show that Dilated-MobileNets can obtain up to 2% higher classification accuracy than MobileNet.

APA, Harvard, Vancouver, ISO, and other styles

7

Xu, Shilin, and Caili Guo. "Computation Offloading in a Cognitive Vehicular Networks with Vehicular Cloud Computing and Remote Cloud Computing." Sensors 20, no. 23 (November 29, 2020): 6820. http://dx.doi.org/10.3390/s20236820.

Full text

Abstract:

To satisfy the explosive growth of computation-intensive vehicular applications, we investigated the computation offloading problem in a cognitive vehicular networks (CVN). Specifically, in our scheme, the vehicular cloud computing (VCC)- and remote cloud computing (RCC)-enabled computation offloading were jointly considered. So far, extensive research has been conducted on RCC-based computation offloading, while the studies on VCC-based computation offloading are relatively rare. In fact, due to the dynamic and uncertainty of on-board resource, the VCC-based computation offloading is more challenging then the RCC one, especially under the vehicular scenario with expensive inter-vehicle communication or poor communication environment. To solve this problem, we propose to leverage the VCC’s computation resource for computation offloading with a perception-exploitation way, which mainly comprise resource discovery and computation offloading two stages. In resource discovery stage, upon the action-observation history, a Long Short-Term Memory (LSTM) model is proposed to predict the on-board resource utilizing status at next time slot. Thereafter, based on the obtained computation resource distribution, a decentralized multi-agent Deep Reinforcement Learning (DRL) algorithm is proposed to solve the collaborative computation offloading with VCC and RCC. Last but not least, the proposed algorithms’ effectiveness is verified with a host of numerical simulation results from different perspectives.

APA, Harvard, Vancouver, ISO, and other styles

8

Yin, Lujia, Yiming Zhang, Zhaoning Zhang, Yuxing Peng, and Peng Zhao. "ParaX." Proceedings of the VLDB Endowment 14, no. 6 (February 2021): 864–77. http://dx.doi.org/10.14778/3447689.3447692.

Full text

Abstract:

Despite the fact that GPUs and accelerators are more efficient in deep learning (DL), commercial clouds like Facebook and Amazon now heavily use CPUs in DL computation because there are large numbers of CPUs which would otherwise sit idle during off-peak periods. Following the trend, CPU vendors have not only released high-performance many-core CPUs but also developed efficient math kernel libraries. However, current DL platforms cannot scale well to a large number of CPU cores, making many-core CPUs inefficient in DL computation. We analyze the memory access patterns of various layers and identify the root cause of the low scalability, i.e., the per-layer barriers that are implicitly imposed by current platforms which assign one single instance (i.e., one batch of input data) to a CPU. The barriers cause severe memory bandwidth contention and CPU starvation in the access-intensive layers (like activation and BN). This paper presents a novel approach called ParaX, which boosts the performance of DL on many-core CPUs by effectively alleviating bandwidth contention and CPU starvation. Our key idea is to assign one instance to each CPU core instead of to the entire CPU, so as to remove the per-layer barriers on the executions of the many cores. ParaX designs an ultralight scheduling policy which sufficiently overlaps the access-intensive layers with the compute-intensive ones to avoid contention, and proposes a NUMA-aware gradient server mechanism for training which leverages shared memory to substantially reduce the overhead of per-iteration parameter synchronization. We have implemented ParaX on MXNet. Extensive evaluation on a two-NUMA Intel 8280 CPU shows that ParaX significantly improves the training/inference throughput for all tested models (for image recognition and natural language processing) by 1.73X ~ 2.93X.

APA, Harvard, Vancouver, ISO, and other styles

9

Alava, Pallavi, and G. Radhika. "Robust and Secure Framework for Mobile Cloud Computing." Asian Journal of Computer Science and Technology 8, S3 (June 5, 2019): 1–6. http://dx.doi.org/10.51983/ajcst-2019.8.s3.2115.

Full text

Abstract:

Smartphone devices are widely utilized in our daily lives. However, these devices exhibit limitations, like short battery lifetime, limited computation power, small memory size and unpredictable network connectivity. Therefore, various solutions have been projected to mitigate these limitations and extend the battery period of time with the employment of the offloading technique. In this paper, a unique framework is projected to offload intensive computation tasks from the mobile device to the cloud. This framework uses Associate in Nursing improvement model to work out the offloading decision dynamically supported four main parameters, namely, energy consumption, CPU utilization, execution time, and memory usage. Additionally, a new security layer is provided to shield the transferred data within the cloud from any attack. The experimental results showed that the framework will choose an acceptable offloading decision for various forms of mobile application tasks whereas achieving important performance improvement. Moreover, different from previous techniques, the framework will defend application knowledge from any threat.

APA, Harvard, Vancouver, ISO, and other styles

10

Piao, Yongri, Zhengkun Rong, Miao Zhang, and Huchuan Lu. "Exploit and Replace: An Asymmetrical Two-Stream Architecture for Versatile Light Field Saliency Detection." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 07 (April 3, 2020): 11865–73. http://dx.doi.org/10.1609/aaai.v34i07.6860.

Full text

Abstract:

Light field saliency detection is becoming of increasing interest in recent years due to the significant improvements in challenging scenes by using abundant light field cues. However, high dimension of light field data poses computation-intensive and memory-intensive challenges, and light field data access is far less ubiquitous as RGB data. These may severely impede practical applications of light field saliency detection. In this paper, we introduce an asymmetrical two-stream architecture inspired by knowledge distillation to confront these challenges. First, we design a teacher network to learn to exploit focal slices for higher requirements on desktop computers and meanwhile transfer comprehensive focusness knowledge to the student network. Our teacher network is achieved relying on two tailor-made modules, namely multi-focusness recruiting module (MFRM) and multi-focusness screening module (MFSM), respectively. Second, we propose two distillation schemes to train a student network towards memory and computation efficiency while ensuring the performance. The proposed distillation schemes ensure better absorption of focusness knowledge and enable the student to replace the focal slices with a single RGB image in an user-friendly way. We conduct the experiments on three benchmark datasets and demonstrate that our teacher network achieves state-of-the-arts performance and student network (ResNet18) achieves Top-1 accuracies on HFUT-LFSD dataset and Top-4 on DUT-LFSD, which tremendously minimizes the model size by 56% and boosts the Frame Per Second (FPS) by 159%, compared with the best performing method.

APA, Harvard, Vancouver, ISO, and other styles

11

Zhang, Xiaodong, and Lin Sun. "Comparative Evaluation and Case Studies of Shared-Memory and Data-Parallel Execution Patterns." Scientific Programming 7, no. 1 (1999): 1–19. http://dx.doi.org/10.1155/1999/468372.

Full text

Abstract:

Shared‐memory and data‐parallel programming models are two important paradigms for scientific applications. Both models provide high‐level program abstractions, and simple and uniform views of network structures. The common features of the two models significantly simplify program coding and debugging for scientific applications. However, the underlining execution and overhead patterns are significantly different between the two models due to their programming constraints, and due to different and complex structures of interconnection networks and systems which support the two models. We performed this experimental study to present implications and comparisons of execution patterns on two commercial architectures. We implemented a standard electromagnetic simulation program (EM) and a linear system solver using the shared‐memory model on the KSR‐1 and the data‐parallel model on the CM‐5. Our objectives are to examine the execution pattern changes required for an implementation transformation between the two models; to study memory access patterns; to address scalability issues; and to investigate relative costs and advantages/disadvantages of using the two models for scientific computations. Our results indicate that the EM program tends to become computation‐intensive in the KSR‐1 shared‐memory system, and memory‐demanding in the CM‐5 data‐parallel system when the systems and the problems are scaled. The EM program, a highly data‐parallel program performed extremely well, and the linear system solver, a highly control‐structured program suffered significantly in the data‐parallel model on the CM‐5. Our study provides further evidence that matching execution patterns of algorithms to parallel architectures would achieve better performance.

APA, Harvard, Vancouver, ISO, and other styles

12

Lim, Hyunyul, Tae Hyun Kim, and Sungho Kang. "Prediction-Based Error Correction for GPU Reliability with Low Overhead." Electronics 9, no. 11 (November 5, 2020): 1849. http://dx.doi.org/10.3390/electronics9111849.

Full text

Abstract:

Scientific and simulation applications are continuously gaining importance in many fields of research and industries. These applications require massive amounts of memory and substantial arithmetic computation. Therefore, general-purpose computing on graphics processing units (GPGPU), which combines the computing power of graphics processing units (GPUs) and general CPUs, have been used for computationally intensive scientific and big data processing applications. Because current GPU architectures lack hardware support for error detection in computation logic, GPGPU has low reliability. Unlike graphics applications, errors in GPGPU can lead to serious problems in general-purpose computing applications. These applications are often intertwined with human life, meaning that errors can be life threatening. Therefore, this paper proposes a novel prediction-based error correction method called Prediction-based Error Correction (PRECOR) for GPU reliability, which detects and corrects errors in GPGPU platforms with a focus on errors in computational elements. The implementation of the proposed architecture needs a small number of checkpoint buffers in order to fix errors in computational logic. The PRECOR architecture has prediction buffers and controller units for predicting erroneous outputs before performing rollback. Following a rollback, the architecture confirms the accuracy of its predictions. The proposed method effectively reduces the hardware and time overheads required to correct errors. Experimental results confirm that PRECOR efficiently fixes errors with low hardware and time overheads.

APA, Harvard, Vancouver, ISO, and other styles

13

Tran, Nhat-Phuong, Myungho Lee, and Sugwon Hong. "Performance Optimization of 3D Lattice Boltzmann Flow Solver on a GPU." Scientific Programming 2017 (2017): 1–16. http://dx.doi.org/10.1155/2017/1205892.

Full text

Abstract:

Lattice Boltzmann Method (LBM) is a powerful numerical simulation method of the fluid flow. With its data parallel nature, it is a promising candidate for a parallel implementation on a GPU. The LBM, however, is heavily data intensive and memory bound. In particular, moving the data to the adjacent cells in the streaming computation phase incurs a lot of uncoalesced accesses on the GPU which affects the overall performance. Furthermore, the main computation kernels of the LBM use a large number of registers per thread which limits the thread parallelism available at the run time due to the fixed number of registers on the GPU. In this paper, we develop high performance parallelization of the LBM on a GPU by minimizing the overheads associated with the uncoalesced memory accesses while improving the cache locality using the tiling optimization with the data layout change. Furthermore, we aggressively reduce the register uses for the LBM kernels in order to increase the run-time thread parallelism. Experimental results on the Nvidia Tesla K20 GPU show that our approach delivers impressive throughput performance: 1210.63 Million Lattice Updates Per Second (MLUPS).

APA, Harvard, Vancouver, ISO, and other styles

14

Nishikant Sadafale, Minal Deshmukh, Prasad Khandekar,. "AN EFFICIENT FPGA OVERLAY FOR COLOR TRANSFORMATION FUNCTION USING HIGH LEVEL SYNTHESIS." INFORMATION TECHNOLOGY IN INDUSTRY 9, no. 1 (March 1, 2021): 280–87. http://dx.doi.org/10.17762/itii.v9i1.130.

Full text

Abstract:

Image Processing is a significantly desirable in commercial, industrial, and medical applications. Processor based architectures are inappropriate for real time applications as Image processing algorithms are quite intensive in terms of computations. To reduce latency and limitation in performance due to limited amount of memory and fixed clock frequency for synthesis in processor-based architecture, FPGA can be used in smart devices for implementing real time image processing applications. To increase speed of real time image processing custom overlays (Hardware Library of programmable logic circuit) can be designed to run on FPGA fabric. The IP core generated by the HLS (High Level Synthesis) can be implemented on a reconfigurable platform which allows effective utilization of channel bandwidth and storage. In this paper we have presented FPGA overlay design for color transformation function using Xilinx’s python productivity board PYNQ-Z2 to get benefit in performance over a traditional processor. Performance comparison of custom overlay on FPGA and Processor based platform shows FPGA execution yields minimum computation time.

APA, Harvard, Vancouver, ISO, and other styles

15

Ebdon, Susan Austin, Mary McGee Coakley, and Danielle Legnard. "Mathematical Mind Journeys: Awakening Minds to Computational Fluency." Teaching Children Mathematics 9, no. 8 (April 2003): 486–87. http://dx.doi.org/10.5951/tcm.9.8.0486.

Full text

Abstract:

Have you ever wondered what your students are really thinking as they do mathematics? Do you wish that you could stimulate and excite your students while building the fundamental skills necessary for their future? Join us as we take our students on a Mathematical Mind Journey. This learning adventure does not require time-intensive planning or expensive props and materials. Mathematical Mind Journeys are “think aloud” strategies that demystify computation. Students use metacognition to explain the paths that their brains take when solving a problem and rely on mathematical memory rather than memorization. Whether you have a few minutes or a class period, a Mathematical Mind Journey will empower and engage every student in your class.

APA, Harvard, Vancouver, ISO, and other styles

16

AHARONI, GAD, AMNON BARAK, and AMIR RONEN. "A competitive algorithm for managing sharing in the distributed execution of functional programs." Journal of Functional Programming 7, no. 4 (July 1997): 421–40. http://dx.doi.org/10.1017/s095679689700275x.

Full text

Abstract:

Execution of functional programs on distributed-memory multiprocessors gives rise to the problem of evaluating expressions that are shared between several Processing Elements (PEs). One of the main difficulties of solving this problem is that, for a given shared expression, it is not known in advance whether realizing the sharing is more cost effective than duplicating its evaluation. Realizing the sharing requires coordination between the sharing PEs to ensure that the shared expression is evaluated only once. This coordination involves relatively high communication costs, and is therefore only worthwhile when the shared expressions require much computation time to evaluate. In contrast, when the shared expression is not computation intensive, it is more cost effective to duplicate the evaluation, and thus avoid the communication overhead costs. This dilemma of deciding whether to duplicate the work or to realize the sharing stems from the unknown computation time that is required to evaluate a shared expression. This computation time is difficult to estimate due to unknown run-time evolution of loops and recursion that may be part of the expression. This paper presents an on-line (run-time) algorithm that decides which of the expressions that are shared between several PEs should be evaluated only once, and which expressions should be evaluated locally by each sharing PE. By applying competitive considerations, the algorithm manages to exploit sharing of computation-intensive expressions, while it duplicates the evaluation of expressions that require little time to compute. The algorithm accomplishes this goal even though it has no a priori knowledge of the amount of computation that is required to evaluate the shared expression. We show that this algorithm is competitive with a hypothetical optimal off-line algorithm, which does have such knowledge, and we prove that the algorithm is deadlock free. Furthermore, this algorithm does not require any programmer intervention, it has low overhead, and it is designed to run on a wide variety of distributed systems.

APA, Harvard, Vancouver, ISO, and other styles

17

MENG, WENHUI, and JUNZHI CUI. "COMPARATIVE STUDY OF TWO DIFFERENT FMM–BEM METHODS IN SOLVING 2-D ACOUSTIC TRANSMISSION PROBLEMS WITH A MULTILAYERED OBSTACLE." International Journal of Structural Stability and Dynamics 11, no. 01 (February 2011): 197–214. http://dx.doi.org/10.1142/s021945541100404x.

Full text

Abstract:

The fast multipole method (FMM) is an effective approach for accelerating the computation efficiency of the boundary element method (BEM) in solving problems that are computationally intensive. This paper presents two different BEMs, i.e., Kress' and Seydou's methods, for solving two-dimensional (2D) acoustic transmission problems with a multilayered obstacle, along with application of the FMM to solution of the related boundary integral equations. Conventional BEM requires O(MN2) operations to compute the equations for this problem. By using the FMM, both the amount of computation and the memory requirement of the BEM are reduced to order O(MN), where M is the number of layers of the obstacle. The efficiency and accuracy of this approach in dealing with the acoustic transmission problems containing a multilayered obstacle are demonstrated in the numerical examples. It is confirmed that this approach can be applied to solving the acoustic transmission problems for an obstacle with multilayers.

APA, Harvard, Vancouver, ISO, and other styles

18

Zhang, Libo, Benqiang Yang, Zhikun Zhuang, Yining Hu, Yang Chen, Limin Luo, and Huazhong Shu. "Optimized Parallelization for Nonlocal Means Based Low Dose CT Image Processing." Computational and Mathematical Methods in Medicine 2015 (2015): 1–11. http://dx.doi.org/10.1155/2015/790313.

Full text

Abstract:

Low dose CT (LDCT) images are often significantly degraded by severely increased mottled noise/artifacts, which can lead to lowered diagnostic accuracy in clinic. The nonlocal means (NLM) filtering can effectively remove mottled noise/artifacts by utilizing large-scale patch similarity information in LDCT images. But the NLM filtering application in LDCT imaging also requires high computation cost because intensive patch similarity calculation within a large searching window is often required to be used to include enough structure-similarity information for noise/artifact suppression. To improve its clinical feasibility, in this study we further optimize the parallelization of NLM filtering by avoiding the repeated computation with the row-wise intensity calculation and the symmetry weight calculation. The shared memory with fastI/Ospeed is also used in row-wise intensity calculation for the proposed method. Quantitative experiment demonstrates that significant acceleration can be achieved with respect to the traditional straight pixel-wise parallelization.

APA, Harvard, Vancouver, ISO, and other styles

19

Sheng, Jinfang, Jie Hu, Xiaoyu Teng, Bin Wang, and Xiaoxia Pan. "Computation Offloading Strategy in Mobile Edge Computing." Information 10, no. 6 (June 2, 2019): 191. http://dx.doi.org/10.3390/info10060191.

Full text

Abstract:

Mobile phone applications have been rapidly growing and emerging with the Internet of Things (IoT) applications in augmented reality, virtual reality, and ultra-clear video due to the development of mobile Internet services in the last three decades. These applications demand intensive computing to support data analysis, real-time video processing, and decision-making for optimizing the user experience. Mobile smart devices play a significant role in our daily life, and such an upward trend is continuous. Nevertheless, these devices suffer from limited resources such as CPU, memory, and energy. Computation offloading is a promising technique that can promote the lifetime and performance of smart devices by offloading local computation tasks to edge servers. In light of this situation, the strategy of computation offloading has been adopted to solve this problem. In this paper, we propose a computation offloading strategy under a scenario of multi-user and multi-mobile edge servers that considers the performance of intelligent devices and server resources. The strategy contains three main stages. In the offloading decision-making stage, the basis of offloading decision-making is put forward by considering the factors of computing task size, computing requirement, computing capacity of server, and network bandwidth. In the server selection stage, the candidate servers are evaluated comprehensively by multi-objective decision-making, and the appropriate servers are selected for the computation offloading. In the task scheduling stage, a task scheduling model based on the improved auction algorithm has been proposed by considering the time requirement of the computing tasks and the computing performance of the mobile edge computing server. Extensive simulations have demonstrated that the proposed computation offloading strategy could effectively reduce service delay and the energy consumption of intelligent devices, and improve user experience.

APA, Harvard, Vancouver, ISO, and other styles

20

Rojek, Krzysztof, Kamil Halbiniak, and Lukasz Kuczynski. "CFD code adaptation to the FPGA architecture." International Journal of High Performance Computing Applications 35, no. 1 (November 10, 2020): 33–46. http://dx.doi.org/10.1177/1094342020972461.

Full text

Abstract:

For the last years, we observe the intensive development of accelerated computing platforms. Although current trends indicate a well-established position of GPU devices in the HPC environment, FPGA (Field-Programmable Gate Array) aspires to be an alternative solution to offload the CPU computation. This paper presents a systematic adaptation of four various CFD (Computational Fluids Dynamic) kernels to the Xilinx Alveo U250 FPGA. The goal of this paper is to investigate the potential of the FPGA architecture as the future infrastructure able to provide the most complex numerical simulations in the area of fluid flow modeling. The selected kernels are customized to a real-scientific scenario, compatible with the EULAG (Eulerian/semi-Lagrangian) fluid solver. The solver is used to simulate thermo-fluid flows across a wide range of scales and is extensively used in numerical weather prediction. The proposed adaptation is focused on the analysis of the strengths and weaknesses of the FPGA accelerator, considering performance and energy efficiency. The proposed adaptation is compared with a CPU implementation that was strongly optimized to provide realistic and objective benchmarks. The performance results are compared with a set of server CPUs containing various Intel generations, including Intel SkyLake-based CPUs as Xeon Gold 6148 and Xeon Platinum 8168, as well as Intel Xeon E5-2695 CPU based on the IvyBridge architecture. Since all the kernels belong to the group of memory-bound algorithms, our main challenge is to saturate global memory bandwidth and provide data locality with the intensive BRAM (Block RAM) reusing. Our adaptation allows us to reduce the performance per watt up to 80% compared to the CPUs.

APA, Harvard, Vancouver, ISO, and other styles

21

Somula, Ramasubbareddy, and Sasikala R. "A Load and Distance Aware Cloudlet Selection Strategy in Multi-Cloudlet Environment." International Journal of Grid and High Performance Computing 11, no. 2 (April 2019): 85–102. http://dx.doi.org/10.4018/ijghpc.2019040105.

Full text

Abstract:

Day to day the usage of mobile devices (MD) is growing in people's lives. But still the MD is limited in terms of memory, battery life time, processing capacity. In order to overcome these issues, the new emerging technology named mobile cloud computing (MCC) has been introduced. The offloading mechanism execute the resource intensive application on the remote cloud to save both the battery utilization and execution time. But still the high latency challenges in MCC need to be addressed by executing resource intensive task at nearby resource cloud server. The key challenge is to find optimal cloudlet to execute task to save computation time. In this article, the authors propose a Round Robin algorithm based on cloudlet selection in heterogeneous MCC system. This article considers both load and distance of server to find optimal cloudlet and minimize waiting time of the user request at server queue. Additionally, the authors provide mathematical evaluation of the algorithm and compare with existing load balancing algorithms.

APA, Harvard, Vancouver, ISO, and other styles

22

CRAUS, Mitică, and Silviu-Dumitru PAVĂL. "An Accelerating Numerical Computation of the Diffusion Term in a Nonlocal Reaction-Diffusion Equation." Mathematics 8, no. 12 (November 26, 2020): 2111. http://dx.doi.org/10.3390/math8122111.

Full text

Abstract:

In this paper we propose and compare two methods to optimize the numerical computations for the diffusion term in a nonlocal formulation for a reaction-diffusion equation. The diffusion term is particularly computationally intensive due to the integral formulation, and thus finding a better way of computing its numerical approximation could be of interest, given that the numerical analysis usually takes place on large input domains having more than one dimension. After introducing the general reaction-diffusion model, we discuss a numerical approximation scheme for the diffusion term, based on a finite difference method. In the next sections we propose two algorithms to solve the numerical approximation scheme, focusing on finding a way to improve the time performance. While the first algorithm (sequential) is used as a baseline for performance measurement, the second algorithm (parallel) is implemented using two different memory-sharing parallelization technologies: Open Multi-Processing (OpenMP) and CUDA. All the results were obtained by using the model in image processing applications such as image restoration and segmentation.

APA, Harvard, Vancouver, ISO, and other styles

23

Ivanović, Miloš, Ana Kaplarević-Mališić, Boban Stojanović, Marina Svičević, and Srboljub M. Mijailovich. "Machine learned domain decomposition scheme applied to parallel multi-scale muscle simulation." International Journal of High Performance Computing Applications 33, no. 5 (March 12, 2019): 885–96. http://dx.doi.org/10.1177/1094342019833151.

Full text

Abstract:

Since multi-scale models of muscles rely on the integration of physical and biochemical properties across multiple length and time scales, they are highly processor and memory intensive. Consequently, their practical implementation and usage in real-world applications is limited by high computational requirements. There are various reported solutions to the problem of parallel computation of various multi-scale models, but due to their inherent complexity, load balancing remains a challenging task. In this article, we present a novel load balancing method for multi-scale simulations based on finite element (FE) method. The method employs a computationally simple single-scale model and machine learning in order to predict computational weights of the integration points within a complex multi-scale model. Employing the obtained weights, it is possible to improve the domain decomposition prior to the complex multi-scale simulation run and consequently reduce computation time. The method is applied to a two-scale muscle model, where the FE on macroscale is coupled with Huxley’s model of cross-bridge kinetics on the microscale. Our massive parallel solution is based on the static domain decomposition policy and operates in a heterogeneous (central processing units + graphics processing units) environment. The approach has been verified on a real-world example of the human tongue, showing high utilization of all processors and ensuring high scalability, owing to the proposed load balancing scheme. The performance analysis shows that the inclusion of the prediction of the computational weights reduces execution time by about 40% compared to the run which uses a trivial load balancer which assumes identical computational weights of all micro-models. The proposed domain decomposition approach possesses a high capability to be applied in a variety of multi-scale models based on the FE method.

APA, Harvard, Vancouver, ISO, and other styles

24

Lee, Byoung-Dai, Kwang-Ho Lim, and Namgi Kim. "Development of Energy-aware Mobile Applications Based on Resource Outsourcing." International Journal of Software Engineering and Knowledge Engineering 24, no. 08 (October 2014): 1225–43. http://dx.doi.org/10.1142/s0218194014500399.

Full text

Abstract:

Smart connected devices such as smartphones and tablets are battery-operated to facilitate their mobility. Therefore, low power consumption is a critical requirement for mobile hardware and for the software designed for such devices. In addition to efficient power management techniques and new battery technologies based on nanomaterials, cloud computing has emerged as a promising technique for reducing energy consumption as well as augmenting the computational and memory capabilities of mobile devices. In this study, we designed and implemented a framework that allows for the energy-efficient execution of mobile applications by partially offloading the workload of a mobile device onto a resourceful cloud. This framework comprises a development toolkit, which facilitates the development of mobile applications capable of supporting computation offloading, and a runtime infrastructure for deployment in the cloud. Using this framework, we implemented three different mobile applications and demonstrated that considerable energy savings can be achieved compared with local processing for both resource-intensive and lightweight applications, especially when using high-speed networks such as Wi-Fi and Long-Term Evolution.

APA, Harvard, Vancouver, ISO, and other styles

25

Su, Huayou, Mei Wen, Nan Wu, Ju Ren, and Chunyuan Zhang. "Efficient Parallel Video Processing Techniques on GPU: From Framework to Implementation." Scientific World Journal 2014 (2014): 1–19. http://dx.doi.org/10.1155/2014/716020.

Full text

Abstract:

Through reorganizing the execution order and optimizing the data structure, we proposed an efficient parallel framework for H.264/AVC encoder based on massively parallel architecture. We implemented the proposed framework by CUDA on NVIDIA’s GPU. Not only the compute intensive components of the H.264 encoder are parallelized but also the control intensive components are realized effectively, such as CAVLC and deblocking filter. In addition, we proposed serial optimization methods, including the multiresolution multiwindow for motion estimation, multilevel parallel strategy to enhance the parallelism of intracoding as much as possible, component-based parallel CAVLC, and direction-priority deblocking filter. More than 96% of workload of H.264 encoder is offloaded to GPU. Experimental results show that the parallel implementation outperforms the serial program by 20 times of speedup ratio and satisfies the requirement of the real-time HD encoding of 30 fps. The loss of PSNR is from 0.14 dB to 0.77 dB, when keeping the same bitrate. Through the analysis to the kernels, we found that speedup ratios of the compute intensive algorithms are proportional with the computation power of the GPU. However, the performance of the control intensive parts (CAVLC) is much related to the memory bandwidth, which gives an insight for new architecture design.

APA, Harvard, Vancouver, ISO, and other styles

26

Fu, Kui, Peipei Shi, Yafei Song, Shiming Ge, Xiangju Lu, and Jia Li. "Ultrafast Video Attention Prediction with Coupled Knowledge Distillation." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 07 (April 3, 2020): 10802–9. http://dx.doi.org/10.1609/aaai.v34i07.6710.

Full text

Abstract:

Large convolutional neural network models have recently demonstrated impressive performance on video attention prediction. Conventionally, these models are with intensive computation and large memory. To address these issues, we design an extremely light-weight network with ultrafast speed, named UVA-Net. The network is constructed based on depth-wise convolutions and takes low-resolution images as input. However, this straight-forward acceleration method will decrease performance dramatically. To this end, we propose a coupled knowledge distillation strategy to augment and train the network effectively. With this strategy, the model can further automatically discover and emphasize implicit useful cues contained in the data. Both spatial and temporal knowledge learned by the high-resolution complex teacher networks also can be distilled and transferred into the proposed low-resolution light-weight spatiotemporal network. Experimental results show that the performance of our model is comparable to 11 state-of-the-art models in video attention prediction, while it costs only 0.68 MB memory footprint, runs about 10,106 FPS on GPU and 404 FPS on CPU, which is 206 times faster than previous models.

APA, Harvard, Vancouver, ISO, and other styles

27

Han, Zhe, Jingfei Jiang, Linbo Qiao, Yong Dou, Jinwei Xu, and Zhigang Kan. "Accelerating Event Detection with DGCNN and FPGAs." Electronics 9, no. 10 (October 13, 2020): 1666. http://dx.doi.org/10.3390/electronics9101666.

Full text

Abstract:

Recently, Deep Neural Networks (DNNs) have been widely used in natural language processing. However, DNNs are often computation-intensive and memory-expensive. Therefore, deploying DNNs in the real world is very difficult. In order to solve this problem, we proposed a network model based on the dilate gated convolutional neural network, which is very hardware-friendly. We further expanded the word representations and depth of the network to improve the performance of the model. We replaced the Sigmoid function to make it more friendly for hardware computation without loss, and we quantized the network weights and activations to compress the network size. We then proposed the first FPGA (Field Programmable Gate Array)-based event detection accelerator based on the proposed model. The accelerator significantly reduced the latency with the fully pipelined architecture. We implemented the accelerator on the Xilinx XCKU115 FPGA. The experimental results show that our model obtains the highest F1-score of 84.6% in the ACE 2005 corpus. Meanwhile, the accelerator achieved 95.2 giga operations (GOP)/s and 13.4 GOPS/W in performance and energy efficiency, which is 17/158 times higher than the Graphics Processing Unit (GPU).

APA, Harvard, Vancouver, ISO, and other styles

28

Sinharoy, Balaram. "Compiler Optimization to Improve Data Locality for Processor Multithreading." Scientific Programming 7, no. 1 (1999): 21–37. http://dx.doi.org/10.1155/1999/235625.

Full text

Abstract:

Over the last decade processor speed has increased dramatically, whereas the speed of the memory subsystem improved at a modest rate. Due to the increase in the cache miss latency (in terms of the processor cycle), processors stall on cache misses for a significant portion of its execution time. Multithreaded processors has been proposed in the literature to reduce the processor stall time due to cache misses. Although multithreading improves processor utilization, it may also increase cache miss rates, because in a multithreaded processor multiple threads share the same cache, which effectively reduces the cache size available to each individual thread. Increased processor utilization and the increase in the cache miss rate demands higher memory bandwidth. A novel compiler optimization method has been presented in this paper that improves data locality for each of the threads and enhances data sharing among the threads. The method is based on loop transformation theory and optimizes both spatial and temporal data locality. The created threads exhibit high level of intra‐thread and inter‐thread data locality which effectively reduces both the data cache miss rates and the total execution time of numerically intensive computation running on a multithreaded processor.

APA, Harvard, Vancouver, ISO, and other styles

29

Muktar Yahuza, Yamani Idna Bin Idris, Ainuddin Wahid Bin Abdul Wahab, Mahdi A. Musa, and Adamu Abdullahi Garba. "A LIGHTWEIGHT AUTHENTICATION TECHNIQUE FOR SECURE COMMUNICATION OF EDGE/FOG DATA-CENTERS." Science Proceedings Series 2, no. 1 (April 25, 2020): 76–81. http://dx.doi.org/10.31580/sps.v2i1.1319.

Full text

Abstract:

Edge computing has significantly enhanced the capabilities of cloud computing. Edge data-centres are used for storing data of the end-user devices. Secure communication between the legitimate edge data-centres during the load balancing process has attracted industrial and academic researchers. Recently, Puthal et al. have proposed a technique for authenticating edge datacenters to enable secure load balancing. However, the resource-constraint nature of the edge data-centres is ignored. The scheme is characterized by complex computation and memory intensive cryptographic protocol. It is also vulnerable to key escrow attack because the secret key used for encrypting and decrypting of the communicated messages is been created by the trusted cloud datacenter. Additionally, the key sharing phase of their algorithm is complex. Therefore, to address the highlighted challenges, this paper proposed a lightweight key escrow-less authentication algorithm that will ensure secure communication of resource-constrained edge data-centres during the load balancing process. The security capability of the proposed scheme has been formally evaluated using the automatic cryptographic analytical tool ProVerif. The relatively low computation and communication costs of the proposed scheme compared to the benchmark schemes proved that it is lightweight, thus suitable for resource-constrained edge datacenters.

APA, Harvard, Vancouver, ISO, and other styles

30

Chen, Chuanglu, Zhiqiang Li, Yitao Zhang, Shaolong Zhang, Jiena Hou, and Haiying Zhang. "Low-Power FPGA Implementation of Convolution Neural Network Accelerator for Pulse Waveform Classification." Algorithms 13, no. 9 (August 31, 2020): 213. http://dx.doi.org/10.3390/a13090213.

Full text

Abstract:

In pulse waveform classification, the convolution neural network (CNN) shows excellent performance. However, due to its numerous parameters and intensive computation, it is challenging to deploy a CNN model to low-power devices. To solve this problem, we implement a CNN accelerator based on a field-programmable gate array (FPGA), which can accurately and quickly infer the waveform category. By designing the structure of CNN, we significantly reduce its parameters on the premise of high accuracy. Then the CNN is realized on FPGA and optimized by a variety of memory access optimization methods. Experimental results show that our customized CNN has high accuracy and fewer parameters, and the accelerator costs only 0.714 W under a working frequency of 100 MHz, which proves that our proposed solution is feasible. Furthermore, the accelerator classifies the pulse waveform in real time, which could help doctors make the diagnosis.

APA, Harvard, Vancouver, ISO, and other styles

31

Mahmood, Faisal, Märt Toots, Lars-Göran Öfverstedt, and Ulf Skoglund. "Algorithm and Architecture Optimization for 2D Discrete Fourier Transforms with Simultaneous Edge Artifact Removal." International Journal of Reconfigurable Computing 2018 (August 6, 2018): 1–17. http://dx.doi.org/10.1155/2018/1403181.

Full text

Abstract:

Two-dimensional discrete Fourier transform (DFT) is an extensively used and computationally intensive algorithm, with a plethora of applications. 2D images are, in general, nonperiodic but are assumed to be periodic while calculating their DFTs. This leads to cross-shaped artifacts in the frequency domain due to spectral leakage. These artifacts can have critical consequences if the DFTs are being used for further processing, specifically for biomedical applications. In this paper we present a novel FPGA-based solution to calculate 2D DFTs with simultaneous edge artifact removal for high-performance applications. Standard approaches for removing these artifacts, using apodization functions or mirroring, either involve removing critical frequencies or necessitate a surge in computation by significantly increasing the image size. We use a periodic plus smooth decomposition-based approach that was optimized to reduce DRAM access and to decrease 1D FFT invocations. 2D FFTs on FPGAs also suffer from the so-called “intermediate storage” or “memory wall” problem, which is due to limited on-chip memory, increasingly large image sizes, and strided column-wise external memory access. We propose a “tile-hopping” memory mapping scheme that significantly improves the bandwidth of the external memory for column-wise reads and can reduce the energy consumption up to 53%. We tested our proposed optimizations on a PXIe-based Xilinx Kintex 7 FPGA system communicating with a host PC, which gives us the advantage of further expanding the design for biomedical applications such as electron microscopy and tomography. We demonstrate that our proposed optimizations can lead to 2.8× reduced FPGA and DRAM energy consumption when calculating high-throughput 4096×4096 2D FFTs with simultaneous edge artifact removal. We also used our high-performance 2D FFT implementation to accelerate filtered back-projection for reconstructing tomographic data.

APA, Harvard, Vancouver, ISO, and other styles

32

de la Kethulle de Ryhove, Sébastien, and Rune Mittet. "3D marine magnetotelluric modeling and inversion with the finite-difference time-domain method." GEOPHYSICS 79, no. 6 (November 1, 2014): E269—E286. http://dx.doi.org/10.1190/geo2014-0110.1.

Full text

Abstract:

Frequency-domain methods, which are typically applied to 3D magnetotelluric (MT) modeling, require solving a system of linear equations for every frequency of interest. This is memory and computationally intensive. We developed a finite-difference time-domain algorithm to perform 3D MT modeling in a marine environment in which Maxwell’s equations are solved in a so-called fictitious-wave domain. Boundary conditions are efficiently treated via convolutional perfectly matched layers, for which we evaluated optimized parameter values obtained by testing over a large number of models. In comparison to the typically applied frequency-domain methods, two advantages of the finite-difference time-domain method are (1) that it is an explicit, low-memory method that entirely avoids the solution of systems of linear equations and (2) that it allows the computation of the electromagnetic field unknowns at all frequencies of interest in a single simulation. We derive a design criterion for vertical node spacing in a nonuniform grid using dispersion analysis as a starting point. Modeling results obtained using our finite-difference time-domain algorithm are compared with results obtained using an integral equation method. The agreement was found to be very good. We also discuss a real data inversion example in which MT modeling was done with our algorithm.

APA, Harvard, Vancouver, ISO, and other styles

33

Li, Jinfeng. "Computing Benchmark of Gadolinium-bearing Fuel Pins’ Depletion Skin Effect based on Deterministic and Monte Carlo Methods." Annals of Emerging Technologies in Computing 5, no. 1 (January 1, 2021): 1–12. http://dx.doi.org/10.33166/aetic.2021.01.001.

Full text

Abstract:

Nuclear reactor core depletion and thermal-hydraulics coupling have long been calculation-intensive tasks challenging both nuclear industry development and academic research projects regarding computing budgets of memory and time. Albeit future evolution in smart computation hardware with artificial intelligence and quantum computing facilities embedded could continuously push the predictive modelling limit, the fundamental reactor physics model will still tip the balance in underpinning the prediction accuracy, as evidenced by a benchmark of two computational models in this work for characterising the depletion of highly self-shielded Gadolinium burnable poison-bearing fuel pins in assessing the British first European Pressurised Reactor’s start-up core performance. Specifically, a sub-group multi-annular-ring method is verified to efficiently represent the self-shielded skin effect, which addresses the deficiencies of classic equivalence models. The subgroup method is subsequently applied into a deterministic neutron transport code and a Monte Carlo stochastic code, respectively, for another benchmark. Resulting discrepancies in power peaking factors for the same assembly are less than 2% for the first fuel cycle, the agreement of which well demonstrates the validity of the proposed subgroup model. At the forefront of efforts to quantitatively understand the burnable poisons’ behaviour precisely for fuel optimisation (e.g., mitigating power peaking), this work could also be advantageously used for training purposes in boosting safety philosophy and public engagement in the roadmap for decarbonisation.

APA, Harvard, Vancouver, ISO, and other styles

34

Inupakutika, D., D. Akopian, P. Chalela, and A. G. Ramirez. "Performance analysis of Mobile Cloud Computing Architectures for mHealth app." Electronic Imaging 2020, no. 3 (January 26, 2020): 335–1. http://dx.doi.org/10.2352/issn.2470-1173.2020.3.mobmu-332.

Full text

Abstract:

Mobile Health (mHealth) applications (apps) are being widely used to monitor health of patients with chronic medical conditions with the proliferation and the increasing use of smartphones. Mobile devices have limited computation power and energy supply which may lead to either delayed alarms, shorter battery life or excessive memory usage limiting their ability to execute resource-intensive functionality and inhibit proper medical monitoring. These limitations can be overcome by the integration of mobile and cloud computing (Mobile Cloud Computing (MCC)) that expands mobile devices' capabilities. With the advent of different MCC architectures such as implementation of mobile user-side tools or network-side architectures it is hence important to decide a suitable architecture for mHealth apps. We survey MCC architectures and present a comparative analysis of performance against a resource demanding representative testing scenario in a prototype mHealth app. This work will compare numerically the mobile cloud architectures for a case study mHealth app for Endocrine Hormonal Therapy (EHT) adherence. Experimental results are reported and conclusions are drawn concerning the design of the prototype mHealth app system using the MCC architectures.

APA, Harvard, Vancouver, ISO, and other styles

35

Shehzad, Muhammad Faisal, Mainak Dan, Valerio Mariani, Seshadhri Srinivasan, Davide Liuzza, Carmine Mongiello, Roberto Saraceno, and Luigi Glielmo. "A Heuristic Algorithm for Combined Heat and Power System Operation Management." Energies 14, no. 6 (March 12, 2021): 1588. http://dx.doi.org/10.3390/en14061588.

Full text

Abstract:

This paper presents a computationally efficient novel heuristic approach for solving the combined heat and power economic dispatch (CHP-ED) problem in residential buildings considering component interconnections. The proposed solution is meant as a substitute for the cutting-edge approaches, such as model predictive control, where the problem is a mixed-integer nonlinear program (MINLP), known to be computationally-intensive, and therefore requiring specialized hardware and sophisticated solvers, not suited for residential use. The proposed heuristic algorithm targets simple embedded hardware with limited computation and memory and, taking as inputs the hourly thermal and electrical demand estimated from daily load profiles, computes a dispatch of the energy vectors including the CHP. The main idea of the heuristic is to have a procedure that initially decomposes the three energy vectors’ requests: electrical, thermal, and hot water. Then, the latter are later combined and dispatched considering interconnection and operational constraints. The proposed algorithm is illustrated using series of simulations on a residential pilot with a nano-cogenerator unit and shows around 25–30% energy savings when compared with a meta-heuristic genetic algorithm approach.

APA, Harvard, Vancouver, ISO, and other styles

36

Awadalla, Medhat, and Ahmed M. Sadek. "An Efficient Cache Organization for On-Chip Multiprocessor Networks." International Journal of Electrical and Computer Engineering (IJECE) 5, no. 3 (June 1, 2015): 503. http://dx.doi.org/10.11591/ijece.v5i3.pp503-517.

Full text

Abstract:

To meet the growing computation-intensive applications and the needs of low-power, high-performance systems, the number of computing resources in single-chip has enormously increased. By adding many computing resources to build a system in System-on-Chip, its interconnection between each other becomes another challenging issue. In most System-on-Chip applications, a shared bus interconnection which needs an arbitration logic to serialize several bus access requests, is adopted to communicate with each integrated processing unit because of its low-cost and simple control characteristics. This paper focuses on the interconnection design issues of area, power and performance of chip multi-processors with shared cache memory. It shows that having shared cache memory contributes to the performance improvement, however, typical interconnection between cores and the shared cache using crossbar occupies most of the chip area, consumes a lot of power and does not scale efficiently with increased number of cores. New interconnection mechanisms are needed to address these issues. This paper proposes an architectural paradigm in an attempt to gain the advantages of having shared cache with the avoidance of penalty imposed by the crossbar interconnect. The proposed architecture achieves smaller area occupation allowing more space to add additional cache memory. It also reduces power consumption compared to the existing crossbar architecture. Furthermore, the paper presents a modified cache coherence algorithm called Tuned-MESI. It is based on the typical MESI cache coherence algorithm however it is tuned and tailored for the suggested architecture. The achieved results of the conducted simulated experiments show that the developed architecture produces less broadcast operations compared to the typical algorithm.

APA, Harvard, Vancouver, ISO, and other styles

37

Kuo, F. A., and J. S. Wu. "Implementation of a parallel high-order WENO-type Euler equation solver using a CUDA PTX paradigm." Journal of Mechanics 37 (2021): 496–512. http://dx.doi.org/10.1093/jom/ufab016.

Full text

Abstract:

ABSTRACT This study proposes the optimization of a low-level assembly code to reconstruct the flux for a splitting flux Harten–Lax–van Leer (SHLL) scheme on high-end graphic processing units. The proposed solver is implemented using the weighted essentially non-oscillatory reconstruction method to simulate compressible gas flows that are derived using an unsteady Euler equation. Instructions in the low-level assembly code, i.e. parallel thread execution and instruction set architecture in compute unified device architecture (CUDA), are used to optimize the CUDA kernel for the flux reconstruction method. The flux reconstruction method is a fifth-order one that is used to process the high-resolution intercell flux for achieving a highly localized scheme, such as the high-order implementation of SHLL scheme. Many benchmarking test cases including shock-tube and four-shock problems are demonstrated and compared. The results show that the reconstruction method is computationally very intensive and can achieve excellent performance up to 5183 GFLOP/s, ∼66% of peak performance of NVIDIA V100, using the low-level CUDA assembly code. The computational efficiency is twice the value as compared with the previous studies. The CUDA assembly code reduces 26.7% calculation and increases 37.5% bandwidth. The results show that the optimal kernel reaches up to 990 GB/s for the bandwidth. The overall efficiency of bandwidth and computation performance achieves 127% of the predicted performance based on the HBM2-memory roofline model estimated by Empirical Roofline Tool.

APA, Harvard, Vancouver, ISO, and other styles

38

Yang, Pengliang, Romain Brossier, and Jean Virieux. "Wavefield reconstruction by interpolating significantly decimated boundaries." GEOPHYSICS 81, no. 5 (September 2016): T197—T209. http://dx.doi.org/10.1190/geo2015-0711.1.

Full text

Abstract:

Many practical seismic applications such as reverse time migration (RTM) and full-waveform inversion (FWI) are usually computation and memory intensive. To perform crosscorrelation in RTM or build the gradient for FWI, it is mandatory to access the forward and adjoint wavefields simultaneously. To do this, there are three methods: One is to read the stored forward wavefield from the disk, the second is using the final snapshot and the stored boundaries via reverse propagation, and the third is remodeling using checkpointing from stored state to another state. Among these techniques, wavefield reconstruction by reverse propagation appears to be a quite straightforward approach; however, it suffers a stringent memory bottleneck for 3D large-scale imaging applications. The Courant-Friedrichs-Lewy (CFL) condition is a fundamental criterion to determine temporal sampling to achieve stable wavefield extrapolation. The injection of the boundary sequence in time is essentially determined by Nyquist sampling principle, rather than the time interval given by CFL, which is much smaller than the Nyquist requirement. Based on this recognition, we have developed three boundary interpolation techniques, such as the discrete Fourier transform (DFT) interpolation, Kaiser windowed sinc interpolation, and Lagrange polynomial interpolation, for wavefield reconstruction to move from CFL to the Nyquist limit. Wavefield reconstruction via DFT interpolation can be implemented by folding and unfolding steps in the forward simulation and backward reconstruction on the fly. Compared with the DFT interpolation, the wavefield reconstruction methods using Kaiser windowed sinc interpolation and Lagrange polynomial interpolation have better efficiency while remaining a competitive accuracy. These methods allow us to dramatically decimate the boundary without significant loss of information, and they nicely reconstruct the boundary elements in between the samples, making the in-core memory saving of the boundaries feasible in 3D large-scale imaging applications.

APA, Harvard, Vancouver, ISO, and other styles

39

Javed, Hassan, Muhammad Bilal, and Shahid Masud. "A Hardware–Software Co-Design Framework for Real-Time Video Stabilization." Journal of Circuits, Systems and Computers 29, no. 02 (May 3, 2019): 2050027. http://dx.doi.org/10.1142/s0218126620500279.

Full text

Abstract:

Live digital video is a valuable source of information in security, broadcast and industrial quality control applications. Motion jitter due to camera and platform instability is a common artefact found in captured video which renders it less effective for subsequent computer vision tasks such as detection and tracking of objects, background modeling, mosaicking, etc. The process of algorithmically compensating for the motion jitter is hence a mandatory pre-processing step in many applications. This process, called video stabilization, requires estimation of global motion from consecutive video frames and is constrainted by additional challenges such as preservation of intentional motion and native frame resolution. The problem is exacerbated in the presence of local motion of foreground objects and requires robust compensation of the same. As such achieving real-time performance for this computationally intensive operation is a difficult task for embedded processors with limited computational and memory resources. In this work, development of an optimized hardware–software co-design framework for video stabilization has been investigated. Efficient video stabilization depends on the identification of key points in the frame which in turn requires dense feature calculation at the pixel level. This task has been identified to be most suitable for offloading the pipelined hardware implemented in the FPGA fabric due to the involvement of complex memory and computation operations. Subsequent tasks to be performed for the overall stabilization algorithm utilize these sparse key points and have been found to be efficiently handled in the software. The proposed Hardware–Software (HW–SW) co-design framework has been implemented on Zedboard FPGA platform which houses Xilinx Zynq SOC equipped with ARM A9 processor. The proposed implementation scheme can process real-time video stream input at 28 frames per second and is at least twice faster than the corresponding software-only approach. Two different hardware accelerator designs have been implemented using different high-level synthesis tools using rapid prototyping principle and consume less than 50% of logic resources available on the host FPGA while being at least 30% faster than contemporary designs.

APA, Harvard, Vancouver, ISO, and other styles

40

Zeng, Xiaogang, and Fang Zhao. "Integral Equation Method via Domain Decomposition and Collocation for Scattering Problems." Journal of Applied Mechanics 62, no. 1 (March 1, 1995): 186–92. http://dx.doi.org/10.1115/1.2895901.

Full text

Abstract:

In this paper an exterior domain decomposition (DD) method based on the boundary element (BE) formulation for the solutions of two or three-dimensional time-harmonic scattering problems in acoustic media is described. It is known that the requirement of large memory and intensive computation has been one of the major obstacles for solving large scale high-frequency acoustic systems using the traditional nonlocal BE formulations due to the fully populated resultant matrix generated from the BE discretization. The essence of this study is to decouple, through DD of the problem-defined exterior region, the original problem into arbitrary subproblems with data sharing only at the interfaces. By decomposing the exterior infinite domain into appropriate number of infinite subdomains, this method not only ensures the validity of the formulation for all frequencies but also leads to a diagonalized, blockwise-banded system of discretized equations, for which the solution requires only O(N) multiplications, where N is the number of unknowns on the scatterer surface. The size of an individual submatrix that is associated with a subdomain may be determined by the user, and may be selected such that the restriction due to the memory limitation of a given computer may be accommodated. In addition, the method may suit for parallel processing since the data associated with each subdomain (impedance matrices, load vectors, etc.) may be generated in parallel, and the communication needed will be only for the interface values. Most significantly, unlike the existing boundary integral-based formulations valid for all frequencies, our method avoids the use of both the hypersingular operators and the double integrals, therefore reducing the computational effort. Numerical experiments have been conducted for rigid cylindrical scatterers subjected to a plane incident wave. The results have demonstrated the accuracy of the method for wave numbers ranging from 0 to 30, both directly on the scatterer and in the far-field, and have confirmed that the procedure is valid for critical frequencies.

APA, Harvard, Vancouver, ISO, and other styles

41

Dharmaraj, Christopher D., Kishan Thadikonda, Anthony R. Fletcher, Phuc N. Doan, Nallathamby Devasahayam, Shingo Matsumoto, Calvin A. Johnson, et al. "Reconstruction for Time-Domain In Vivo EPR 3D Multigradient Oximetric Imaging—A Parallel Processing Perspective." International Journal of Biomedical Imaging 2009 (2009): 1–12. http://dx.doi.org/10.1155/2009/528639.

Full text

Abstract:

Three-dimensional Oximetric Electron Paramagnetic Resonance Imaging using the Single Point Imaging modality generates unpaired spin density and oxygen images that can readily distinguish between normal and tumor tissues in small animals. It is also possible with fast imaging to track the changes in tissue oxygenation in response to the oxygen content in the breathing air. However, this involves dealing with gigabytes of data for each 3D oximetric imaging experiment involving digital band pass filtering and background noise subtraction, followed by 3D Fourier reconstruction. This process is rather slow in a conventional uniprocessor system. This paper presents a parallelization framework using OpenMP runtime support and parallel MATLAB to execute such computationally intensive programs. The Intel compiler is used to develop a parallel C++ code based on OpenMP. The code is executed on four Dual-Core AMD Opteron shared memory processors, to reduce the computational burden of the filtration task significantly. The results show that the parallel code for filtration has achieved a speed up factor of 46.66 as against the equivalent serial MATLAB code. In addition, a parallel MATLAB code has been developed to perform 3D Fourier reconstruction. Speedup factors of 4.57 and 4.25 have been achieved during the reconstruction process and oximetry computation, for a data set with23×23×23gradient steps. The execution time has been computed for both the serial and parallel implementations using different dimensions of the data and presented for comparison. The reported system has been designed to be easily accessible even from low-cost personal computers through local internet (NIHnet). The experimental results demonstrate that the parallel computing provides a source of high computational power to obtain biophysical parameters from 3D EPR oximetric imaging, almost in real-time.

APA, Harvard, Vancouver, ISO, and other styles

42

Zhou, Chao, and Tao Zhang. "High Performance Graph Data Imputation on Multiple GPUs." Future Internet 13, no. 2 (January 31, 2021): 36. http://dx.doi.org/10.3390/fi13020036.

Full text

Abstract:

In real applications, massive data with graph structures are often incomplete due to various restrictions. Therefore, graph data imputation algorithms have been widely used in the fields of social networks, sensor networks, and MRI to solve the graph data completion problem. To keep the data relevant, a data structure is represented by a graph-tensor, in which each matrix is the vertex value of a weighted graph. The convolutional imputation algorithm has been proposed to solve the low-rank graph-tensor completion problem that some data matrices are entirely unobserved. However, this data imputation algorithm has limited application scope because it is compute-intensive and low-performance on CPU. In this paper, we propose a scheme to perform the convolutional imputation algorithm with higher time performance on GPUs (Graphics Processing Units) by exploiting multi-core GPUs of CUDA architecture. We propose optimization strategies to achieve coalesced memory access for graph Fourier transform (GFT) computation and improve the utilization of GPU SM resources for singular value decomposition (SVD) computation. Furthermore, we design a scheme to extend the GPU-optimized implementation to multiple GPUs for large-scale computing. Experimental results show that the GPU implementation is both fast and accurate. On synthetic data of varying sizes, the GPU-optimized implementation running on a single Quadro RTX6000 GPU achieves up to 60.50× speedups over the GPU-baseline implementation. The multi-GPU implementation achieves up to 1.81× speedups on two GPUs versus the GPU-optimized implementation on a single GPU. On the ego-Facebook dataset, the GPU-optimized implementation achieves up to 77.88× speedups over the GPU-baseline implementation. Meanwhile, the GPU implementation and the CPU implementation achieve similar, low recovery errors.

APA, Harvard, Vancouver, ISO, and other styles

43

Shishkin, Sergey. "TO CALCULATION OF ELLIPTIC CRUSHER WITH FORM MEMORY EFFECT." Bulletin of Bryansk state technical university 2020, no. 7 (June 28, 2020): 12–19. http://dx.doi.org/10.30987/1999-8775-2020-7-12-19.

Full text

Abstract:

The investigation purpose consists in the development of a calculation model of a thermo-mechanical power device intended for destruction of concrete structures and hard mineral rock. Its operation principle is based on the transformation of the initial round cross-section into the oval one during realization of alloy form memory that ensures a wedging effect. There is offered the solution of the problem on linear effort definition of a crusher at its impact upon well sides depending on rock resistance and capacity of thermo-mechanical return. The investigation method consists in the modeling of a power pipe with the cylindrical shell at its non-axisymmetric loading as a deformation-power analogue of which a thermo-mechanical diagram is accepted. The essential condition of such an approach is the identity of a task and deformation restoration with samples at the diagram formation and a power element in the structure. Thereupon fundamentally new is a calculation definition of the parameters of reactive stress dependence upon the deformation value of under-restoration at radial bending according to the diagram at the specified deformation by stretching that allows excluding a labor-intensive experiment. As a result of the investigations there are obtained formulae for the definition of a linear effort and also for the computation of pipe dimensions within the limits of design calculation. There is shown an actual example of power characteristics definition. The calculation reliability is confirmed with the application of these power devices during radioactive concrete structure destruction at one of nuclear power plants and at hard rock destruction at emerald field development. In such a way, the thermo-mechanical crusher design offered and a corresponding procedure for calculations can be used in practice without any changes.

APA, Harvard, Vancouver, ISO, and other styles

44

Huang, Wanrong, Xiaodong Yi, Yichun Sun, Yingwen Liu, Shuai Ye, and Hengzhu Liu. "Scalable Parallel Distributed Coprocessor System for Graph Searching Problems with Massive Data." Scientific Programming 2017 (2017): 1–9. http://dx.doi.org/10.1155/2017/1496104.

Full text

Abstract:

The Internet applications, such as network searching, electronic commerce, and modern medical applications, produce and process massive data. Considerable data parallelism exists in computation processes of data-intensive applications. A traversal algorithm, breadth-first search (BFS), is fundamental in many graph processing applications and metrics when a graph grows in scale. A variety of scientific programming methods have been proposed for accelerating and parallelizing BFS because of the poor temporal and spatial locality caused by inherent irregular memory access patterns. However, new parallel hardware could provide better improvement for scientific methods. To address small-world graph problems, we propose a scalable and novel field-programmable gate array-based heterogeneous multicore system for scientific programming. The core is multithread for streaming processing. And the communication network InfiniBand is adopted for scalability. We design a binary search algorithm to address mapping to unify all processor addresses. Within the limits permitted by the Graph500 test bench after 1D parallel hybrid BFS algorithm testing, our 8-core and 8-thread-per-core system achieved superior performance and efficiency compared with the prior work under the same degree of parallelism. Our system is efficient not as a special acceleration unit but as a processor platform that deals with graph searching applications.

APA, Harvard, Vancouver, ISO, and other styles

45

ABELSON, HAROLD, ANDREW A. BERLIN, JACOB KATZENELSON, WILLIAM H. McALLISTER, GUILLERMO J. ROZAS, GERALD JAY SUSSMAN, and JACK WISDOM. "THE SUPERCOMPUTER TOOLKIT: A GENERAL FRAMEWORK FOR SPECIAL-PURPOSE COMPUTING." International Journal of High Speed Electronics and Systems 03, no. 03n04 (September 1992): 337–61. http://dx.doi.org/10.1142/s0129156492000138.

Full text

Abstract:

The Supercomputer Toolkit is a family of hardware modules (processors, memory, interconnect, and input-output devices) and a collection of software modules (compilers, simulators, scientific libraries, and high-level front ends) from which high-performance special-purpose computers can be easily configured and programmed. Although there are many examples of special-purpose computers (see Ref. 4), the Toolkit approach is different in that our aim is to construct these machines from standard, reusable parts. These are combined by means of a user-reconfigurable, static interconnect technology. The Toolkit’s software support, based on novel compilation techniques, produces extremely high-performance numerical code from high-level language input. We have completed fabrication of the Toolkit processor module, and several critical software modules. An eight-processor configuration is running at MIT. We have used the prototype Toolkit to perform a breakthrough computation of scientific importance—an integration of the motion of the Solar System that extends previous results by nearly two orders of magnitude. While the Toolkit project is not complete, we believe our results show evidence that generating special-purpose computers from standard modules can be an important method of performing intensive scientific computing. This paper briefly describes the Toolkit’s hardware and software modules, the Solar System simulation, conclusions and future plans.

APA, Harvard, Vancouver, ISO, and other styles

46

Tian, Ling, Yu Cao, Bokun He, Yifan Zhang, Chu He, and Deshi Li. "Image Enhancement Driven by Object Characteristics and Dense Feature Reuse Network for Ship Target Detection in Remote Sensing Imagery." Remote Sensing 13, no. 7 (March 31, 2021): 1327. http://dx.doi.org/10.3390/rs13071327.

Full text

Abstract:

As the application scenarios of remote sensing imagery (RSI) become richer, the task of ship detection from an overhead perspective is of great significance. Compared with traditional methods, the use of deep learning ideas has more prospects. However, the Convolutional Neural Network (CNN) has poor resistance to sample differences in detection tasks, and the huge differences in the image environment, background, and quality of RSIs affect the performance for target detection tasks; on the other hand, upsampling or pooling operations result in the loss of detailed information in the features, and the CNN with outstanding results are often accompanied by a high computation and a large amount of memory storage. Considering the characteristics of ship targets in RSIs, this study proposes a detection framework combining an image enhancement module with a dense feature reuse module: (1) drawing on the ideas of the generative adversarial network (GAN), we designed an image enhancement module driven by object characteristics, which improves the quality of the ship target in the images while augmenting the training set; (2) the intensive feature extraction module was designed to integrate low-level location information and high-level semantic information of different resolutions while minimizing the computation, which can improve the efficiency of feature reuse in the network; (3) we introduced the receptive field expansion module to obtain a wider range of deep semantic information and enhance the ability to extract features of targets were at different sizes. Experiments were carried out on two types of ship datasets, optical RSI and Synthetic Aperture Radar (SAR) images. The proposed framework was implemented on classic detection networks such as You Only Look Once (YOLO) and Mask-RCNN. The experimental results verify the effectiveness of the proposed method.

APA, Harvard, Vancouver, ISO, and other styles

47

Masum Refat, Chowdhury Mohammad, and Norsinnira Zainul Azlan. "Stretch Sensor-Based Facial Expression Recognition and Classification Using Machine Learning." International Journal of Computational Intelligence and Applications 20, no. 02 (April 6, 2021): 2150010. http://dx.doi.org/10.1142/s1469026821500103.

Full text

Abstract:

Sensor-based Facial expression recognition (FER) is an attractive research topic. Nowadays, FER is used for different application such as smart environments and healthcare solutions. The machine can learn human emotion by using FER technology. It is the primary and essential for quantitative analysis of human sentiments. FER is an image recognition problem within the broader field of computer vision. Face detection and tracking, reliable face recognition still present a considerable challenge for researchers in computer vision and pattern recognition. First, data processing and analytics are intensive and require a large number of computation resources and memory. Second, the fundamental technical limitation is its robustness in changes in the environment. Finally, illumination variation further complicates the design of robust algorithms because of changes in shadow casts. However, sensor-based FER overcomes all these limitations. Sensor technologies, especially low-power, wireless communication, high-capacity, and data processing have made substantial progress, making it possible for sensors to evolve from low-level data collection and transmission to high-level inference. This study aims to develop a stretchable sensor-based FER system. We use random forest machine learning algorithms used for training the FER model. Commercial stretchable facial expression dataset is simulated into the anaconda software. In this research, our stretch sensor FER dataset obtained around 95% accuracy for four different emotions (Neutral, Happy, Sad, and Disgust).

APA, Harvard, Vancouver, ISO, and other styles

48

Liu, Zhiqiang, Paul Chow, Jinwei Xu, Jingfei Jiang, Yong Dou, and Jie Zhou. "A Uniform Architecture Design for Accelerating 2D and 3D CNNs on FPGAs." Electronics 8, no. 1 (January 7, 2019): 65. http://dx.doi.org/10.3390/electronics8010065.

Full text

Abstract:

Three-dimensional convolutional neural networks (3D CNNs) have gained popularity in many complicated computer vision applications. Many customized accelerators based on FPGAs are proposed for 2D CNNs, while very few are for 3D CNNs. Three-D CNNs are far more computationally intensive and the design space for 3D CNN acceleration has been further expanded since one more dimension is introduced, making it a big challenge to accelerate 3D CNNs on FPGAs. Motivated by the finding that the computation patterns of 2D and 3D CNNs are very similar, we propose a uniform architecture design for accelerating both 2D and 3D CNNs in this paper. The uniform architecture is based on the idea of mapping convolutions to matrix multiplications. A customized mapping module is developed to generate the feature matrix tilings with no need to store the entire enlarged feature matrix on-chip or off-chip, a splitting strategy is adopted to reconstruct a convolutional layer to adapt to the on-chip memory capacity, and a 2D multiply-and-accumulate (MAC) array is adopted to compute matrix multiplications efficiently. For demonstration, we implement an accelerator prototype with a high-level synthesis (HLS) methodology on a Xilinx VC709 board and test the accelerator on three typical CNN models: AlexNet, VGG16, and C3D. Experimental results show that the accelerator achieves state-of-the-art throughput performance on both 2D and 3D CNNs, with much better energy efficiency than the CPU and GPU.

APA, Harvard, Vancouver, ISO, and other styles

49

Chen, Qinyu, Yuxiang Fu, Wenqing Song, Kaifeng Cheng, Zhonghai Lu, Chuan Zhang, and Li Li. "An Efficient Streaming Accelerator for Low Bit-Width Convolutional Neural Networks." Electronics 8, no. 4 (March 27, 2019): 371. http://dx.doi.org/10.3390/electronics8040371.

Full text

Abstract:

Convolutional Neural Networks (CNNs) have been widely applied in various fields, such as image recognition, speech processing, as well as in many big-data analysis tasks. However, their large size and intensive computation hinder their deployment in hardware, especially on the embedded systems with stringent latency, power, and area requirements. To address this issue, low bit-width CNNs are proposed as a highly competitive candidate. In this paper, we propose an efficient, scalable accelerator for low bit-width CNNs based on a parallel streaming architecture. With a novel coarse grain task partitioning (CGTP) strategy, the proposed accelerator with heterogeneous computing units, supporting multi-pattern dataflows, can nearly double the throughput for various CNN models on average. Besides, a hardware-friendly algorithm is proposed to simplify the activation and quantification process, which can reduce the power dissipation and area overhead. Based on the optimized algorithm, an efficient reconfigurable three-stage activation-quantification-pooling (AQP) unit with the low power staged blocking strategy is developed, which can process activation, quantification, and max-pooling operations simultaneously. Moreover, an interleaving memory scheduling scheme is proposed to well support the streaming architecture. The accelerator is implemented with TSMC 40 nm technology with a core size of 0.17 mm 2 . It can achieve 7.03 TOPS/W energy efficiency and 4.14 TOPS/mm 2 area efficiency at 100.1 mW, which makes it a promising design for the embedded devices.

APA, Harvard, Vancouver, ISO, and other styles

50

Fazli, Saeid, and Lindsay Kleeman. "Sensor design and signal processing for an advanced sonar ring." Robotica 24, no. 4 (December 6, 2005): 433–46. http://dx.doi.org/10.1017/s0263574705002432.

Full text

Abstract:

A conventional sonar ring measures the range to objects based on the first echo and is widely used in indoor mobile robots. In contrast, advanced sonar sensing can produce accurate range and bearing (incidence angle) measurements to multiple targets using multiple receivers and multiple echoes per each receiver at the expense of intensive computation. This paper presents an advanced sonar ring that employs a low receiver sample rate to achieve processing of 48 receiver channels at near real time repetition rates of 11.5 Hz. The sonar ring sensing covers 360 degrees around the robot for specular targets for ranges up to six metres, with simultaneously firing of all its 24 transmitters. Digital Signal Processing (DSP) techniques and interference rejection ideas are applied in this sensor to produce a fast and accurate sonar ring. Seven custom designed DSP boards process the receivers sampled at 250 kHz to maximize the speed of processing and to limit memory requirements. This paper presents the new sensor design, the hardware structure, the software architecture, and signal processing of the advanced sonar ring. Repeatability and accuracy of the measurements are tested to characterize the proposed sensor. Due to the low sample rate of 250 kHz, a problem called cycle hopping can occur. The paper presents a solution to cycle hopping and a new transmit coding based on pulse duration to differentiate neighbouring transmitters in the ring. Experimental data show the effectiveness of the designed sensor in indoor environments.

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!