Siga este enlace para ver otros tipos de publicaciones sobre el tema: Neural network accelerator.

Artículos de revistas sobre el tema "Neural network accelerator"

Crea una cita precisa en los estilos APA, MLA, Chicago, Harvard y otros

Elija tipo de fuente:

Consulte los 50 mejores artículos de revistas para su investigación sobre el tema "Neural network accelerator".

Junto a cada fuente en la lista de referencias hay un botón "Agregar a la bibliografía". Pulsa este botón, y generaremos automáticamente la referencia bibliográfica para la obra elegida en el estilo de cita que necesites: APA, MLA, Harvard, Vancouver, Chicago, etc.

También puede descargar el texto completo de la publicación académica en formato pdf y leer en línea su resumen siempre que esté disponible en los metadatos.

Explore artículos de revistas sobre una amplia variedad de disciplinas y organice su bibliografía correctamente.

1

Eliahu, Adi, Ronny Ronen, Pierre-Emmanuel Gaillardon, and Shahar Kvatinsky. "multiPULPly." ACM Journal on Emerging Technologies in Computing Systems 17, no. 2 (2021): 1–27. http://dx.doi.org/10.1145/3432815.

Texto completo
Resumen
Computationally intensive neural network applications often need to run on resource-limited low-power devices. Numerous hardware accelerators have been developed to speed up the performance of neural network applications and reduce power consumption; however, most focus on data centers and full-fledged systems. Acceleration in ultra-low-power systems has been only partially addressed. In this article, we present multiPULPly, an accelerator that integrates memristive technologies within standard low-power CMOS technology, to accelerate multiplication in neural network inference on ultra-low-power systems. This accelerator was designated for PULP, an open-source microcontroller system that uses low-power RISC-V processors. Memristors were integrated into the accelerator to enable power consumption only when the memory is active, to continue the task with no context-restoring overhead, and to enable highly parallel analog multiplication. To reduce the energy consumption, we propose novel dataflows that handle common multiplication scenarios and are tailored for our architecture. The accelerator was tested on FPGA and achieved a peak energy efficiency of 19.5 TOPS/W, outperforming state-of-the-art accelerators by 1.5× to 4.5×.
Los estilos APA, Harvard, Vancouver, ISO, etc.
2

Hong, JiUn, Saad Arslan, TaeGeon Lee, and HyungWon Kim. "Design of Power-Efficient Training Accelerator for Convolution Neural Networks." Electronics 10, no. 7 (2021): 787. http://dx.doi.org/10.3390/electronics10070787.

Texto completo
Resumen
To realize deep learning techniques, a type of deep neural network (DNN) called a convolutional neural networks (CNN) is among the most widely used models aimed at image recognition applications. However, there is growing demand for light-weight and low-power neural network accelerators, not only for inference but also for training process. In this paper, we propose a training accelerator that provides low power and compact chip size targeted for mobile and edge computing applications. It accelerates to achieve the real-time processing of both inference and training using concurrent floating-point data paths. The proposed accelerator can be externally controlled and employs resource sharing and an integrated convolution-pooling block to achieve low area and low energy consumption. We implemented the proposed training accelerator in an FPGA (Field Programmable Gate Array) and evaluated its training performance using an MNIST CNN example in comparison with a PC with GPU (Graphics Processing Unit). While both methods achieved a similar training accuracy of 95.1%, the proposed accelerator, when implemented in a silicon chip, reduced the energy consumption by 480 times compared to the counterpart. Additionally, when implemented on an FPGA, an energy reduction of over 4.5 times was achieved compared to the existing FPGA training accelerator for the MNIST dataset. Therefore, the proposed accelerator is more suitable for deployment in mobile/edge nodes compared to the existing software and hardware accelerators.
Los estilos APA, Harvard, Vancouver, ISO, etc.
3

Cho, Jaechan, Yongchul Jung, Seongjoo Lee, and Yunho Jung. "Reconfigurable Binary Neural Network Accelerator with Adaptive Parallelism Scheme." Electronics 10, no. 3 (2021): 230. http://dx.doi.org/10.3390/electronics10030230.

Texto completo
Resumen
Binary neural networks (BNNs) have attracted significant interest for the implementation of deep neural networks (DNNs) on resource-constrained edge devices, and various BNN accelerator architectures have been proposed to achieve higher efficiency. BNN accelerators can be divided into two categories: streaming and layer accelerators. Although streaming accelerators designed for a specific BNN network topology provide high throughput, they are infeasible for various sensor applications in edge AI because of their complexity and inflexibility. In contrast, layer accelerators with reasonable resources can support various network topologies, but they operate with the same parallelism for all the layers of the BNN, which degrades throughput performance at certain layers. To overcome this problem, we propose a BNN accelerator with adaptive parallelism that offers high throughput performance in all layers. The proposed accelerator analyzes target layer parameters and operates with optimal parallelism using reasonable resources. In addition, this architecture is able to fully compute all types of BNN layers thanks to its reconfigurability, and it can achieve a higher area–speed efficiency than existing accelerators. In performance evaluation using state-of-the-art BNN topologies, the designed BNN accelerator achieved an area–speed efficiency 9.69 times higher than previous FPGA implementations and 24% higher than existing VLSI implementations for BNNs.
Los estilos APA, Harvard, Vancouver, ISO, etc.
4

Noskova, E. S., I. E. Zakharov, Y. N. Shkandybin, and S. G. Rykovanov. "Towards energy-efficient neural network calculations." Computer Optics 46, no. 1 (2022): 160–66. http://dx.doi.org/10.18287/2412-6179-co-914.

Texto completo
Resumen
Nowadays, the problem of creating high-performance and energy-efficient hardware for Artificial Intelligence tasks is very acute. The most popular solution to this problem is the use of Deep Learning Accelerators, such as GPUs and Tensor Processing Units to run neural networks. Recently, NVIDIA has announced the NVDLA project, which allows one to design neural network accelerators based on an open-source code. This work describes a full cycle of creating a prototype NVDLA accelerator, as well as testing the resulting solution by running the resnet-50 neural network on it. Finally, an assessment of the performance and power efficiency of the prototype NVDLA accelerator when compared to the GPU and CPU is provided, the results of which show the superiority of NVDLA in many characteristics.
Los estilos APA, Harvard, Vancouver, ISO, etc.
5

Fan, Yuxiao. "Design and research of high-performance convolutional neural network accelerator based on Chipyard." Journal of Physics: Conference Series 2858, no. 1 (2024): 012001. http://dx.doi.org/10.1088/1742-6596/2858/1/012001.

Texto completo
Resumen
Abstract Neural network accelerator performs well in the research and verification of neural network models. In this paper, a convolutional neural network accelerator system composed of RISC-V processor core and Gemmini array accelerator is designed in Chisel language within the Chipyard framework, and the acceleration effect of different Gemmini array configurations for different input matrices is further investigated. The result shows that the accelerator system can achieve thousands of times acceleration compared with a single processor for large matrix calculations.
Los estilos APA, Harvard, Vancouver, ISO, etc.
6

Xu, Jia, Han Pu, and Dong Wang. "Sparse Convolution FPGA Accelerator Based on Multi-Bank Hash Selection." Micromachines 16, no. 1 (2024): 22. https://doi.org/10.3390/mi16010022.

Texto completo
Resumen
Reconfigurable processor-based acceleration of deep convolutional neural network (DCNN) algorithms has emerged as a widely adopted technique, with particular attention on sparse neural network acceleration as an active research area. However, many computing devices that claim high computational power still struggle to execute neural network algorithms with optimal efficiency, low latency, and minimal power consumption. Consequently, there remains significant potential for further exploration into improving the efficiency, latency, and power consumption of neural network accelerators across diverse computational scenarios. This paper investigates three key techniques for hardware acceleration of sparse neural networks. The main contributions are as follows: (1) Most neural network inference tasks are typically executed on general-purpose computing devices, which often fail to deliver high energy efficiency and are not well-suited for accelerating sparse convolutional models. In this work, we propose a specialized computational circuit for the convolutional operations of sparse neural networks. This circuit is designed to detect and eliminate the computational effort associated with zero values in the sparse convolutional kernels, thereby enhancing energy efficiency. (2) The data access patterns in convolutional neural networks introduce significant pressure on the high-latency off-chip memory access process. Due to issues such as data discontinuity, the data reading unit often fails to fully exploit the available bandwidth during off-chip read and write operations. In this paper, we analyze bandwidth utilization in the context of convolutional accelerator data handling and propose a strategy to improve off-chip access efficiency. Specifically, we leverage a compiler optimization plugin developed for Vitis HLS, which automatically identifies and optimizes on-chip bandwidth utilization. (3) In coefficient-based accelerators, the synchronous operation of individual computational units can significantly hinder efficiency. Previous approaches have achieved asynchronous convolution by designing separate memory units for each computational unit; however, this method consumes a substantial amount of on-chip memory resources. To address this issue, we propose a shared feature map cache design for asynchronous convolution in the accelerators presented in this paper. This design resolves address access conflicts when multiple computational units concurrently access a set of caches by utilizing a hash-based address indexing algorithm. Moreover, the shared cache architecture reduces data redundancy and conserves on-chip resources. Using the optimized accelerator, we successfully executed ResNet50 inference on an Intel Arria 10 1150GX FPGA, achieving a throughput of 497 GOPS, or an equivalent computational power of 1579 GOPS, with a power consumption of only 22 watts.
Los estilos APA, Harvard, Vancouver, ISO, etc.
7

Ferianc, Martin, Hongxiang Fan, Divyansh Manocha, et al. "Improving Performance Estimation for Design Space Exploration for Convolutional Neural Network Accelerators." Electronics 10, no. 4 (2021): 520. http://dx.doi.org/10.3390/electronics10040520.

Texto completo
Resumen
Contemporary advances in neural networks (NNs) have demonstrated their potential in different applications such as in image classification, object detection or natural language processing. In particular, reconfigurable accelerators have been widely used for the acceleration of NNs due to their reconfigurability and efficiency in specific application instances. To determine the configuration of the accelerator, it is necessary to conduct design space exploration to optimize the performance. However, the process of design space exploration is time consuming because of the slow performance evaluation for different configurations. Therefore, there is a demand for an accurate and fast performance prediction method to speed up design space exploration. This work introduces a novel method for fast and accurate estimation of different metrics that are of importance when performing design space exploration. The method is based on a Gaussian process regression model parametrised by the features of the accelerator and the target NN to be accelerated. We evaluate the proposed method together with other popular machine learning based methods in estimating the latency and energy consumption of our implemented accelerator on two different hardware platforms targeting convolutional neural networks. We demonstrate improvements in estimation accuracy, without the need for significant implementation effort or tuning.
Los estilos APA, Harvard, Vancouver, ISO, etc.
8

Sunny, Febin P., Asif Mirza, Mahdi Nikdast, and Sudeep Pasricha. "ROBIN: A Robust Optical Binary Neural Network Accelerator." ACM Transactions on Embedded Computing Systems 20, no. 5s (2021): 1–24. http://dx.doi.org/10.1145/3476988.

Texto completo
Resumen
Domain specific neural network accelerators have garnered attention because of their improved energy efficiency and inference performance compared to CPUs and GPUs. Such accelerators are thus well suited for resource-constrained embedded systems. However, mapping sophisticated neural network models on these accelerators still entails significant energy and memory consumption, along with high inference time overhead. Binarized neural networks (BNNs), which utilize single-bit weights, represent an efficient way to implement and deploy neural network models on accelerators. In this paper, we present a novel optical-domain BNN accelerator, named ROBIN , which intelligently integrates heterogeneous microring resonator optical devices with complementary capabilities to efficiently implement the key functionalities in BNNs. We perform detailed fabrication-process variation analyses at the optical device level, explore efficient corrective tuning for these devices, and integrate circuit-level optimization to counter thermal variations. As a result, our proposed ROBIN architecture possesses the desirable traits of being robust, energy-efficient, low latency, and high throughput, when executing BNN models. Our analysis shows that ROBIN can outperform the best-known optical BNN accelerators and many electronic accelerators. Specifically, our energy-efficient ROBIN design exhibits energy-per-bit values that are ∼4 × lower than electronic BNN accelerators and ∼933 × lower than a recently proposed photonic BNN accelerator, while a performance-efficient ROBIN design shows ∼3 × and ∼25 × better performance than electronic and photonic BNN accelerators, respectively.
Los estilos APA, Harvard, Vancouver, ISO, etc.
9

Tang, Wenkai, and Peiyong Zhang. "GPGCN: A General-Purpose Graph Convolution Neural Network Accelerator Based on RISC-V ISA Extension." Electronics 11, no. 22 (2022): 3833. http://dx.doi.org/10.3390/electronics11223833.

Texto completo
Resumen
In the past two years, various graph convolution neural networks (GCNs) accelerators have emerged, each with their own characteristics, but their common disadvantage is that the hardware architecture is not programmable and it is optimized for a specific network and dataset. They may not support acceleration for different GCNs and may not achieve optimal hardware resource utilization for datasets of different sizes. Therefore, given the above shortcomings, and according to the development trend of traditional neural network accelerators, this paper proposes and implements GPGCN: a general-purpose GCNs accelerator architecture based on RISC-V instruction set extension, providing the software programming freedom to support acceleration for various GCNs, and achieving the best acceleration efficiency for different GCNs with different datasets. Compared with traditional CPU, and traditional CPU with vector expansion, GPGCN achieves above 1001×, 267× speedup for GCN with the Cora dataset. Compared with dedicated accelerators, GPGCN has software programmability and supports the acceleration of more GCNs.
Los estilos APA, Harvard, Vancouver, ISO, etc.
10

Xia, Chengpeng, Yawen Chen, Haibo Zhang, Hao Zhang, Fei Dai, and Jigang Wu. "Efficient neural network accelerators with optical computing and communication." Computer Science and Information Systems, no. 00 (2022): 66. http://dx.doi.org/10.2298/csis220131066x.

Texto completo
Resumen
Conventional electronic Artificial Neural Networks (ANNs) accelerators focus on architecture design and numerical computation optimization to improve the training efficiency. However, these approaches have recently encountered bottlenecks in terms of energy efficiency and computing performance, which leads to an increase interest in photonic accelerator. Photonic architectures with low energy consumption, high transmission speed and high bandwidth have been considered as an important role for generation of computing architectures. In this paper, to provide a better understanding of optical technology used in ANN acceleration, we present a comprehensive review for the efficient photonic computing and communication in ANN accelerators. The related photonic devices are investigated in terms of the application in ANNs acceleration, and a classification of existing solutions is proposed that are categorized into optical computing acceleration and optical communication acceleration according to photonic effects and photonic architectures. Moreover, we discuss the challenges for these photonic neural network acceleration approaches to highlight the most promising future research opportunities in this field.
Los estilos APA, Harvard, Vancouver, ISO, etc.
11

Anmin, Kong, and Zhao Bin. "A Parallel Loading Based Accelerator for Convolution Neural Network." International Journal of Machine Learning and Computing 10, no. 5 (2020): 669–74. http://dx.doi.org/10.18178/ijmlc.2020.10.5.989.

Texto completo
Los estilos APA, Harvard, Vancouver, ISO, etc.
12

An, Fubang, Lingli Wang, and Xuegong Zhou. "A High Performance Reconfigurable Hardware Architecture for Lightweight Convolutional Neural Network." Electronics 12, no. 13 (2023): 2847. http://dx.doi.org/10.3390/electronics12132847.

Texto completo
Resumen
Since the lightweight convolutional neural network EfficientNet was proposed by Google in 2019, the series of models have quickly become very popular due to their superior performance with a small number of parameters. However, the existing convolutional neural network hardware accelerators for EfficientNet still have much room to improve the performance of the depthwise convolution, squeeze-and-excitation module and nonlinear activation functions. In this paper, we first design a reconfigurable register array and computational kernel to accelerate the depthwise convolution. Next, we propose a vector unit to implement the nonlinear activation functions and the scale operation. An exchangeable-sequence dual-computational kernel architecture is proposed to improve the performance and the utilization. In addition, the memory architectures are designed to complete the hardware accelerator for the above computing architecture. Finally, in order to evaluate the performance of the hardware accelerator, the accelerator is implemented based on Xilinx XCVU37P. The results show that the proposed accelerator can work at the main system clock frequency of 300 MHz with the DSP kernel at 600 MHz. The performance of EfficientNet-B3 in our architecture can reach 69.50 FPS and 255.22 GOPS. Compared with the latest EfficientNet-B3 accelerator, which uses the same FPGA development board, the accelerator proposed in this paper can achieve a 1.28-fold improvement of single-core performance and 1.38-fold improvement of performance of each DSP.
Los estilos APA, Harvard, Vancouver, ISO, etc.
13

Biookaghazadeh, Saman, Pravin Kumar Ravi, and Ming Zhao. "Toward Multi-FPGA Acceleration of the Neural Networks." ACM Journal on Emerging Technologies in Computing Systems 17, no. 2 (2021): 1–23. http://dx.doi.org/10.1145/3432816.

Texto completo
Resumen
High-throughput and low-latency Convolutional Neural Network (CNN) inference is increasingly important for many cloud- and edge-computing applications. FPGA-based acceleration of CNN inference has demonstrated various benefits compared to other high-performance devices such as GPGPUs. Current FPGA CNN-acceleration solutions are based on a single FPGA design, which are limited by the available resources on an FPGA. In addition, they can only accelerate conventional 2D neural networks. To address these limitations, we present a generic multi-FPGA solution, written in OpenCL, which can accelerate more complex CNNs (e.g., C3D CNN) and achieve a near linear speedup with respect to the available single-FPGA solutions. The design is built upon the Intel Deep Learning Accelerator architecture, with three extensions. First, it includes updates for better area efficiency (up to 25%) and higher performance (up to 24%). Second, it supports 3D convolutions for more challenging applications such as video learning. Third, it supports multi-FPGA communication for higher inference throughput. The results show that utilizing multiple FPGAs can linearly increase the overall bandwidth while maintaining the same end-to-end latency. In addition, the design can outperform other FPGA 2D accelerators by up to 8.4 times and 3D accelerators by up to 1.7 times.
Los estilos APA, Harvard, Vancouver, ISO, etc.
14

Chen, Weijian, Zhi Qi, Zahid Akhtar, and Kamran Siddique. "Resistive-RAM-Based In-Memory Computing for Neural Network: A Review." Electronics 11, no. 22 (2022): 3667. http://dx.doi.org/10.3390/electronics11223667.

Texto completo
Resumen
Processing-in-memory (PIM) is a promising architecture to design various types of neural network accelerators as it ensures the efficiency of computation together with Resistive Random Access Memory (ReRAM). ReRAM has now become a promising solution to enhance computing efficiency due to its crossbar structure. In this paper, a ReRAM-based PIM neural network accelerator is addressed, and different kinds of methods and designs of various schemes are discussed. Various models and architectures implemented for a neural network accelerator are determined for research trends. Further, the limitations or challenges of ReRAM in a neural network are also addressed in this review.
Los estilos APA, Harvard, Vancouver, ISO, etc.
15

Ge, Fen, Ning Wu, Hao Xiao, Yuanyuan Zhang, and Fang Zhou. "Compact Convolutional Neural Network Accelerator for IoT Endpoint SoC." Electronics 8, no. 5 (2019): 497. http://dx.doi.org/10.3390/electronics8050497.

Texto completo
Resumen
As a classical artificial intelligence algorithm, the convolutional neural network (CNN) algorithm plays an important role in image recognition and classification and is gradually being applied in the Internet of Things (IoT) system. A compact CNN accelerator for the IoT endpoint System-on-Chip (SoC) is proposed in this paper to meet the needs of CNN computations. Based on analysis of the CNN structure, basic functional modules of CNN such as convolution circuit and pooling circuit with a low data bandwidth and a smaller area are designed, and an accelerator is constructed in the form of four acceleration chains. After the acceleration unit design is completed, the Cortex-M3 is used to construct a verification SoC and the designed verification platform is implemented on the FPGA to evaluate the resource consumption and performance analysis of the CNN accelerator. The CNN accelerator achieved a throughput of 6.54 GOPS (giga operations per second) by consuming 4901 LUTs without using any hardware multipliers. The comparison shows that the compact accelerator proposed in this paper makes the CNN computational power of the SoC based on the Cortex-M3 kernel two times higher than the quad-core Cortex-A7 SoC and 67% of the computational power of eight-core Cortex-A53 SoC.
Los estilos APA, Harvard, Vancouver, ISO, etc.
16

Wei, Rongshan, Chenjia Li, Chuandong Chen, Guangyu Sun, and Minghua He. "Memory Access Optimization of a Neural Network Accelerator Based on Memory Controller." Electronics 10, no. 4 (2021): 438. http://dx.doi.org/10.3390/electronics10040438.

Texto completo
Resumen
Special accelerator architecture has achieved great success in processor architecture, and it is trending in computer architecture development. However, as the memory access pattern of an accelerator is relatively complicated, the memory access performance is relatively poor, limiting the overall performance improvement of hardware accelerators. Moreover, memory controllers for hardware accelerators have been scarcely researched. We consider that a special accelerator memory controller is essential for improving the memory access performance. To this end, we propose a dynamic random access memory (DRAM) memory controller called NNAMC for neural network accelerators, which monitors the memory access stream of an accelerator and transfers it to the optimal address mapping scheme bank based on the memory access characteristics. NNAMC includes a stream access prediction unit (SAPU) that analyzes the type of data stream accessed by the accelerator via hardware, and designs the address mapping for different banks using a bank partitioning model (BPM). The image mapping method and hardware architecture were analyzed in a practical neural network accelerator. In the experiment, NNAMC achieved significantly lower access latency of the hardware accelerator than the competing address mapping schemes, increased the row buffer hit ratio by 13.68% on average (up to 26.17%), reduced the system access latency by 26.3% on average (up to 37.68%), and lowered the hardware cost. In addition, we also confirmed that NNAMC efficiently adapted to different network parameters.
Los estilos APA, Harvard, Vancouver, ISO, etc.
17

Clements, Joseph, and Yingjie Lao. "DeepHardMark: Towards Watermarking Neural Network Hardware." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 4 (2022): 4450–58. http://dx.doi.org/10.1609/aaai.v36i4.20367.

Texto completo
Resumen
This paper presents a framework for embedding watermarks into DNN hardware accelerators. Unlike previous works that have looked at protecting the algorithmic intellectual properties of deep learning systems, this work proposes a methodology for defending deep learning hardware. Our methodology embeds modifications into the hardware accelerator's functional blocks that can be revealed with the rightful owner's key DNN and corresponding key sample, verifying the legitimate owner. We propose an Lp-box ADMM based algorithm to co-optimize watermark's hardware overhead and impact on the design's algorithmic functionality. We evaluate the performance of the hardware watermarking scheme on popular image classifier models using various accelerator designs. Our results demonstrate that the proposed methodology effectively embeds watermarks while preserving the original functionality of the hardware architecture. Specifically, we can successfully embed watermarks into the deep learning hardware and reliably execute a ResNet ImageNet classifiers with an accuracy degradation of only 0.009%
Los estilos APA, Harvard, Vancouver, ISO, etc.
18

Xia, Chengpeng, Yawen Chen, Haibo Zhang, and Jigang Wu. "STADIA: Photonic Stochastic Gradient Descent for Neural Network Accelerators." ACM Transactions on Embedded Computing Systems 22, no. 5s (2023): 1–23. http://dx.doi.org/10.1145/3607920.

Texto completo
Resumen
Deep Neural Networks (DNNs) have demonstrated great success in many fields such as image recognition and text analysis. However, the ever-increasing sizes of both DNN models and training datasets make deep leaning extremely computation- and memory-intensive. Recently, photonic computing has emerged as a promising technology for accelerating DNNs. While the design of photonic accelerators for DNN inference and forward propagation of DNN training has been widely investigated, the architectural acceleration for equally important backpropagation of DNN training has not been well studied. In this paper, we propose a novel silicon photonic-based backpropagation accelerator for high performance DNN training. Specifically, a general-purpose photonic gradient descent unit named STADIA is designed to implement the multiplication, accumulation, and subtraction operations required for computing gradients using mature optical devices including Mach-Zehnder Interferometer (MZI) and Mircoring Resonator (MRR), which can significantly reduce the training latency and improve the energy efficiency of backpropagation. To demonstrate efficient parallel computing, we propose a STADIA-based backpropagation acceleration architecture and design a dataflow by using wavelength-division multiplexing (WDM). We analyze the precision of STADIA by quantifying the precision limitations imposed by losses and noises. Furthermore, we evaluate STADIA with different element sizes by analyzing the power, area and time delay for photonic accelerators based on DNN models such as AlexNet, VGG19 and ResNet. Simulation results show that the proposed architecture STADIA can achieve significant improvement by 9.7× in time efficiency and 147.2× in energy efficiency, compared with the most advanced optical-memristor based backpropagation accelerator.
Los estilos APA, Harvard, Vancouver, ISO, etc.
19

Li, Yihang. "Sparse-Aware Deep Learning Accelerator." Highlights in Science, Engineering and Technology 39 (April 1, 2023): 305–10. http://dx.doi.org/10.54097/hset.v39i.6544.

Texto completo
Resumen
In view of the difficulty of hardware implementation of convolutional neural network computing, most of the previous convolutional neural network accelerator designs focused on solving the bottleneck of computational performance and bandwidth, ignoring the importance of convolutional neural network scarcity for accelerator design. In recent years, there are a few convolutional neural network accelerators that can take advantage of the scarcity, but they are usually difficult to consider in terms of computational flexibility, parallel efficiency and resource overhead. In view of the problem that the application of convolutional neural network (CNN) on the embedded side is limited by real-time, and there is a large degree of sparsity in CNN convolution calculation. This paper summarizes the methods of sparsification from the algorithm level and based on FPGA level. The different methods of sparsification and the research and analysis of different application layers are introduced. The advantages and development trend of sparsification are analyzed and summarized.
Los estilos APA, Harvard, Vancouver, ISO, etc.
20

Xie, Xiaoru, Mingyu Zhu, Siyuan Lu, and Zhongfeng Wang. "Efficient Layer-Wise N:M Sparse CNN Accelerator with Flexible SPEC: Sparse Processing Element Clusters." Micromachines 14, no. 3 (2023): 528. http://dx.doi.org/10.3390/mi14030528.

Texto completo
Resumen
Recently, the layer-wise N:M fine-grained sparse neural network algorithm (i.e., every M-weights contains N non-zero values) has attracted tremendous attention, as it can effectively reduce the computational complexity with negligible accuracy loss. However, the speed-up potential of this algorithm will not be fully exploited if the right hardware support is lacking. In this work, we design an efficient accelerator for the N:M sparse convolutional neural networks (CNNs) with layer-wise sparse patterns. First, we analyze the performances of different processing element (PE) structures and extensions to construct the flexible PE architecture. Second, the variable sparse convolutional dimensions and sparse ratios are involved in the hardware design. With a sparse PE cluster (SPEC) design, the hardware can efficiently accelerate CNNs with the layer-wise N:M pattern. Finally, we employ the proposed SPEC into the CNN accelerator with flexible network-on-chip and specially designed dataflow. We implement hardware accelerators on Xilinx ZCU102 FPGA and Xilinx VCU118 FPGA and evaluate them with classical CNNs such as Alexnet, VGG-16, and ResNet-50. Compared with existing accelerators designed for structured and unstructured pruned networks, our design achieves the best performance in terms of power efficiency.
Los estilos APA, Harvard, Vancouver, ISO, etc.
21

Hu, Jian, Xianlong Zhang, and Xiaohua Shi. "Simulating Neural Network Processors." Wireless Communications and Mobile Computing 2022 (February 23, 2022): 1–12. http://dx.doi.org/10.1155/2022/7500195.

Texto completo
Resumen
Deep learning has achieved competing results compared with human beings in many fields. Traditionally, deep learning networks are executed on CPUs and GPUs. In recent years, more and more neural network accelerators have been introduced in both academia and industry to improve the performance and energy efficiency for deep learning networks. In this paper, we introduce a flexible and configurable functional NN accelerator simulator, which could be configured to simulate u-architectures for different NN accelerators. The extensible and configurable simulator is helpful for system-level exploration of u-architecture, as well as operator optimization algorithm developments. The simulator is a functional simulator that simulates the latencies of calculation and memory access and the concurrent process between modules, and it gives the number of program execution cycles after the simulation is completed. We also integrated the simulator into the TVM compilation stack as an optional backend. Users can use TVM to write operators and execute them on the simulator.
Los estilos APA, Harvard, Vancouver, ISO, etc.
22

Lim, Se-Min, and Sang-Woo Jun. "MobileNets Can Be Lossily Compressed: Neural Network Compression for Embedded Accelerators." Electronics 11, no. 6 (2022): 858. http://dx.doi.org/10.3390/electronics11060858.

Texto completo
Resumen
Although neural network quantization is an imperative technology for the computation and memory efficiency of embedded neural network accelerators, simple post-training quantization incurs unacceptable levels of accuracy degradation on some important models targeting embedded systems, such as MobileNets. While explicit quantization-aware training or re-training after quantization can often reclaim lost accuracy, this is not always possible or convenient. We present an alternative approach to compressing such difficult neural networks, using a novel variant of the ZFP lossy floating-point compression algorithm to compress both model weights and inter-layer activations and demonstrate that it can be efficiently implemented on an embedded FPGA platform. Our ZFP variant, which we call ZFPe, is designed for efficient implementation on embedded accelerators, such as FPGAs, requiring a fraction of chip resources per bandwidth compared to state-of-the-art lossy compression accelerators. ZFPe-compressing the MobileNet V2 model with an 8-bit budget per weight and activation results in significantly higher accuracy compared to 8-bit integer post-training quantization and shows no loss of accuracy, compared to an uncompressed model when given a 12-bit budget per floating-point value. To demonstrate the benefits of our approach, we implement an embedded neural network accelerator on a realistic embedded acceleration platform equipped with the low-power Lattice ECP5-85F FPGA and a 32 MB SDRAM chip. Each ZFPe module consumes less than 6% of LUTs while compressing or decompressing one value per cycle, requiring a fraction of the resources compared to state-of-the-art compression accelerators while completely removing the memory bottleneck of our accelerator.
Los estilos APA, Harvard, Vancouver, ISO, etc.
23

Yang, Zhi. "Dynamic Logo Design System of Network Media Art Based on Convolutional Neural Network." Mobile Information Systems 2022 (May 31, 2022): 1–10. http://dx.doi.org/10.1155/2022/3247229.

Texto completo
Resumen
Nowadays, we are in an era of rapid development of Internet technology and unlimited expansion of information dissemination. While the application of new media and digital multimedia has become more popular, it has also brought Earth shaking changes to our life. In order to solve the problem that the traditional static visual image has been difficult to meet people’s needs, a network media art dynamic logo design system based on convolutional neural network is proposed. Firstly, the software and hardware platform related to accelerator development is introduced, the advanced integrated design and calculation IP core are determined as the FPGA hardware accelerator, and the design objectives and requirements of the accelerator system are analyzed. The overall architecture of the accelerator system is designed. 76% of designers believe that the dynamic logo has promoted the corporate image. Then, the function and architecture of IP core are designed based on advanced synthesis, the code structure is standardized, the function is divided, and the operation acceleration is further optimized by using the instruction set of HLS. Finally, the design is integrated by Vivado HLS and Vivado IDE software. The experiment shows that the accelerator system has low power consumption and high resource utilization.
Los estilos APA, Harvard, Vancouver, ISO, etc.
24

Afifi, Salma, Febin Sunny, Amin Shafiee, Mahdi Nikdast, and Sudeep Pasricha. "GHOST: A Graph Neural Network Accelerator using Silicon Photonics." ACM Transactions on Embedded Computing Systems 22, no. 5s (2023): 1–25. http://dx.doi.org/10.1145/3609097.

Texto completo
Resumen
Graph neural networks (GNNs) have emerged as a powerful approach for modelling and learning from graph-structured data. Multiple fields have since benefitted enormously from the capabilities of GNNs, such as recommendation systems, social network analysis, drug discovery, and robotics. However, accelerating and efficiently processing GNNs require a unique approach that goes beyond conventional artificial neural network accelerators, due to the substantial computational and memory requirements of GNNs. The slowdown of scaling in CMOS platforms also motivates a search for alternative implementation substrates. In this paper, we present GHOST , the first silicon-photonic hardware accelerator for GNNs. GHOST efficiently alleviates the costs associated with both vertex-centric and edge-centric operations. It implements separately the three main stages involved in running GNNs in the optical domain, allowing it to be used for the inference of various widely used GNN models and architectures, such as graph convolution networks and graph attention networks. Our simulation studies indicate that GHOST exhibits at least 10.2 × better throughput and 3.8 × better energy efficiency when compared to GPU, TPU, CPU and multiple state-of-the-art GNN hardware accelerators.
Los estilos APA, Harvard, Vancouver, ISO, etc.
25

Liang, Yong, Junwen Tan, Zhisong Xie, Zetao Chen, Daoqian Lin, and Zhenhao Yang. "Research on Convolutional Neural Network Inference Acceleration and Performance Optimization for Edge Intelligence." Sensors 24, no. 1 (2023): 240. http://dx.doi.org/10.3390/s24010240.

Texto completo
Resumen
In recent years, edge intelligence (EI) has emerged, combining edge computing with AI, and specifically deep learning, to run AI algorithms directly on edge devices. In practical applications, EI faces challenges related to computational power, power consumption, size, and cost, with the primary challenge being the trade-off between computational power and power consumption. This has rendered traditional computing platforms unsustainable, making heterogeneous parallel computing platforms a crucial pathway for implementing EI. In our research, we leveraged the Xilinx Zynq 7000 heterogeneous computing platform, employed high-level synthesis (HLS) for design, and implemented two different accelerators for LeNet-5 using loop unrolling and pipelining optimization techniques. The experimental results show that when running at a clock speed of 100 MHz, the PIPELINE accelerator, compared to the UNROLL accelerator, experiences an 8.09% increase in power consumption but speeds up by 14.972 times, making the PIPELINE accelerator superior in performance. Compared to the CPU, the PIPELINE accelerator reduces power consumption by 91.37% and speeds up by 70.387 times, while compared to the GPU, it reduces power consumption by 93.35%. This study provides two different optimization schemes for edge intelligence applications through design and experimentation and demonstrates the impact of different quantization methods on FPGA resource consumption. These experimental results can provide a reference for practical applications, thereby providing a reference hardware acceleration scheme for edge intelligence applications.
Los estilos APA, Harvard, Vancouver, ISO, etc.
26

Hosseini, Morteza, and Tinoosh Mohsenin. "Binary Precision Neural Network Manycore Accelerator." ACM Journal on Emerging Technologies in Computing Systems 17, no. 2 (2021): 1–27. http://dx.doi.org/10.1145/3423136.

Texto completo
Resumen
This article presents a low-power, programmable, domain-specific manycore accelerator, Binarized neural Network Manycore Accelerator (BiNMAC), which adopts and efficiently executes binary precision weight/activation neural network models. Such networks have compact models in which weights are constrained to only 1 bit and can be packed several in one memory entry that minimizes memory footprint to its finest. Packing weights also facilitates executing single instruction, multiple data with simple circuitry that allows maximizing performance and efficiency. The proposed BiNMAC has light-weight cores that support domain-specific instructions, and a router-based memory access architecture that helps with efficient implementation of layers in binary precision weight/activation neural networks of proper size. With only 3.73% and 1.98% area and average power overhead, respectively, novel instructions such as Combined Population-Count-XNOR , Patch-Select , and Bit-based Accumulation are added to the instruction set architecture of the BiNMAC, each of which replaces execution cycles of frequently used functions with 1 clock cycle that otherwise would have taken 54, 4, and 3 clock cycles, respectively. Additionally, customized logic is added to every core to transpose 16×16-bit blocks of memory on a bit-level basis, that expedites reshaping intermediate data to be well-aligned for bitwise operations. A 64-cluster architecture of the BiNMAC is fully placed and routed in 65-nm TSMC CMOS technology, where a single cluster occupies an area of 0.53 mm 2 with an average power of 232 mW at 1-GHz clock frequency and 1.1 V. The 64-cluster architecture takes 36.5 mm 2 area and, if fully exploited, consumes a total power of 16.4 W and can perform 1,360 Giga Operations Per Second (GOPS) while providing full programmability. To demonstrate its scalability, four binarized case studies including ResNet-20 and LeNet-5 for high-performance image classification, as well as a ConvNet and a multilayer perceptron for low-power physiological applications were implemented on BiNMAC. The implementation results indicate that the population-count instruction alone can expedite the performance by approximately 5×. When other new instructions are added to a RISC machine with existing population-count instruction, the performance is increased by 58% on average. To compare the performance of the BiNMAC with other commercial-off-the-shelf platforms, the case studies with their double-precision floating-point models are also implemented on the NVIDIA Jetson TX2 SoC (CPU+GPU). The results indicate that, within a margin of ∼2.1%--9.5% accuracy loss, BiNMAC on average outperforms the TX2 GPU by approximately 1.9× (or 7.5× with fabrication technology scaled) in energy consumption for image classification applications. On low power settings and within a margin of ∼3.7%--5.5% accuracy loss compared to ARM Cortex-A57 CPU implementation, BiNMAC is roughly ∼9.7×--17.2× (or 38.8×--68.8× with fabrication technology scaled) more energy efficient for physiological applications while meeting the application deadline.
Los estilos APA, Harvard, Vancouver, ISO, etc.
27

Park, Sang-Soo, and Ki-Seok Chung. "CENNA: Cost-Effective Neural Network Accelerator." Electronics 9, no. 1 (2020): 134. http://dx.doi.org/10.3390/electronics9010134.

Texto completo
Resumen
Convolutional neural networks (CNNs) are widely adopted in various applications. State-of-the-art CNN models deliver excellent classification performance, but they require a large amount of computation and data exchange because they typically employ many processing layers. Among these processing layers, convolution layers, which carry out many multiplications and additions, account for a major portion of computation and memory access. Therefore, reducing the amount of computation and memory access is the key for high-performance CNNs. In this study, we propose a cost-effective neural network accelerator, named CENNA, whose hardware cost is reduced by employing a cost-centric matrix multiplication that employs both Strassen’s multiplication and a naïve multiplication. Furthermore, the convolution method using the proposed matrix multiplication can minimize data movement by reusing both the feature map and the convolution kernel without any additional control logic. In terms of throughput, power consumption, and silicon area, the efficiency of CENNA is up to 88 times higher than that of conventional designs for the CNN inference.
Los estilos APA, Harvard, Vancouver, ISO, etc.
28

Kim, Dongyoung, Junwhan Ahn, and Sungjoo Yoo. "ZeNA: Zero-Aware Neural Network Accelerator." IEEE Design & Test 35, no. 1 (2018): 39–46. http://dx.doi.org/10.1109/mdat.2017.2741463.

Texto completo
Los estilos APA, Harvard, Vancouver, ISO, etc.
29

Chen, Tianshi, Zidong Du, Ninghui Sun, et al. "A High-Throughput Neural Network Accelerator." IEEE Micro 35, no. 3 (2015): 24–32. http://dx.doi.org/10.1109/mm.2015.41.

Texto completo
Los estilos APA, Harvard, Vancouver, ISO, etc.
30

To, Chun-Hao, Eduardo Rozo, Elisabeth Krause, Hao-Yi Wu, Risa H. Wechsler, and Andrés N. Salcedo. "LINNA: Likelihood Inference Neural Network Accelerator." Journal of Cosmology and Astroparticle Physics 2023, no. 01 (2023): 016. http://dx.doi.org/10.1088/1475-7516/2023/01/016.

Texto completo
Resumen
Abstract Bayesian posterior inference of modern multi-probe cosmological analyses incurs massive computational costs. For instance, depending on the combinations of probes, a single posterior inference for the Dark Energy Survey (DES) data had a wall-clock time that ranged from 1 to 21 days using a state-of-the-art computing cluster with 100 cores. These computational costs have severe environmental impacts and the long wall-clock time slows scientific productivity. To address these difficulties, we introduce LINNA: the Likelihood Inference Neural Network Accelerator. Relative to the baseline DES analyses, LINNA reduces the computational cost associated with posterior inference by a factor of 8–50. If applied to the first-year cosmological analysis of Rubin Observatory's Legacy Survey of Space and Time (LSST Y1), we conservatively estimate that LINNA will save more than U.S. $300,000 on energy costs, while simultaneously reducing CO2 emission by 2,400 tons. To accomplish these reductions, LINNA automatically builds training data sets, creates neural network emulators, and produces a Markov chain that samples the posterior. We explicitly verify that LINNA accurately reproduces the first-year DES (DES Y1) cosmological constraints derived from a variety of different data vectors with our default code settings, without needing to retune the algorithm every time. Further, we find that LINNA is sufficient for enabling accurate and efficient sampling for LSST Y10 multi-probe analyses. We make LINNA publicly available at https://github.com/chto/linna, to enable others to perform fast and accurate posterior inference in contemporary cosmological analyses.
Los estilos APA, Harvard, Vancouver, ISO, etc.
31

Ro, Yuhwan, Eojin Lee, and Jung Ahn. "Evaluating the Impact of Optical Interconnects on a Multi-Chip Machine-Learning Architecture." Electronics 7, no. 8 (2018): 130. http://dx.doi.org/10.3390/electronics7080130.

Texto completo
Resumen
Following trends that emphasize neural networks for machine learning, many studies regarding computing systems have focused on accelerating deep neural networks. These studies often propose utilizing the accelerator specialized in a neural network and the cluster architecture composed of interconnected accelerator chips. We observed that inter-accelerator communication within a cluster has a significant impact on the training time of the neural network. In this paper, we show the advantages of optical interconnects for multi-chip machine-learning architecture by demonstrating performance improvements through replacing electrical interconnects with optical ones in an existing multi-chip system. We propose to use highly practical optical interconnect implementation and devise an arithmetic performance model to fairly assess the impact of optical interconnects on a machine-learning accelerator platform. In our evaluation of nine Convolutional Neural Networks with various input sizes, 100 and 400 Gbps optical interconnects reduce the training time by an average of 20.6% and 35.6%, respectively, compared to the baseline system with 25.6 Gbps electrical ones.
Los estilos APA, Harvard, Vancouver, ISO, etc.
32

Liu, Yang, Yiheng Zhang, Xiaoran Hao, et al. "Design of a Convolutional Neural Network Accelerator Based on On-Chip Data Reordering." Electronics 13, no. 5 (2024): 975. http://dx.doi.org/10.3390/electronics13050975.

Texto completo
Resumen
Convolutional neural networks have been widely applied in the field of computer vision. In convolutional neural networks, convolution operations account for more than 90% of the total computational workload. The current mainstream approach to achieving high energy-efficient convolution operations is through dedicated hardware accelerators. Convolution operations involve a significant amount of weights and input feature data. Due to limited on-chip cache space in accelerators, there is a significant amount of off-chip DRAM memory access involved in the computation process. The latency of DRAM access is 20 times higher than that of SRAM, and the energy consumption of DRAM access is 100 times higher than that of multiply–accumulate (MAC) units. It is evident that the “memory wall” and “power wall” issues in neural network computation remain challenging. This paper presents the design of a hardware accelerator for convolutional neural networks. It employs a dataflow optimization strategy based on on-chip data reordering. This strategy improves on-chip data utilization and reduces the frequency of data exchanges between on-chip cache and off-chip DRAM. The experimental results indicate that compared to the accelerator without this strategy, it can reduce data exchange frequency by up to 82.9%.
Los estilos APA, Harvard, Vancouver, ISO, etc.
33

Chen, Zhimei. "Hardware Accelerated Optimization of Deep Learning Model on Artificial Intelligence Chip." Frontiers in Computing and Intelligent Systems 6, no. 2 (2023): 11–14. http://dx.doi.org/10.54097/fcis.v6i2.03.

Texto completo
Resumen
With the rapid development of deep learning technology, the demand for computing resources is increasing, and the accelerated optimization of hardware on artificial intelligence (AI) chip has become one of the key ways to solve this challenge. This paper aims to explore the hardware acceleration optimization strategy of deep learning model on AI chip to improve the training and inference performance of the model. In this paper, the method and practice of optimizing deep learning model on AI chip are deeply analyzed by comprehensively considering the hardware characteristics such as parallel processing ability, energy-efficient computing, neural network accelerator, flexibility and programmability, high integration and heterogeneous computing structure. By designing and implementing an efficient convolution accelerator, the computational efficiency of the model is improved. The introduction of energy-efficient computing effectively reduces energy consumption, which provides feasibility for the practical application of mobile devices and embedded systems. At the same time, the optimization design of neural network accelerator becomes the core of hardware acceleration, and deep learning calculation such as convolution and matrix operation are accelerated through special hardware structure, which provides strong support for the real-time performance of the model. By analyzing the actual application cases of hardware accelerated optimization in different application scenarios, this paper highlights the key role of hardware accelerated optimization in improving the performance of deep learning model. Hardware accelerated optimization not only improves the computing efficiency, but also provides efficient and intelligent computing support for AI applications in different fields.
Los estilos APA, Harvard, Vancouver, ISO, etc.
34

Neelam, Srikanth, and A. Amalin Prince. "VCONV: A Convolutional Neural Network Accelerator for FPGAs." Electronics 14, no. 4 (2025): 657. https://doi.org/10.3390/electronics14040657.

Texto completo
Resumen
Field Programmable Gate Arrays (FPGAs), with their wide portfolio of configurable resources such as Look-Up Tables (LUTs), Block Random Access Memory (BRAM), and Digital Signal Processing (DSP) blocks, are the best option for custom hardware designs. Their low power consumption and cost-effectiveness give them an advantage over Graphics Processing Units (GPUs) and Central Processing Units (CPUs) in providing efficient accelerator solutions for compute-intensive Convolutional Neural Network (CNN) models. CNN accelerators are dedicated hardware modules capable of performing compute operations such as convolution, activation, normalization, and pooling with minimal intervention from a host. Designing accelerators for deeper CNN models requires FPGAs with vast resources, which impact its advantages in terms of power and price. In this paper, we propose the VCONV Intellectual Property (IP), an efficient and scalable CNN accelerator architecture for applications where power and cost are constraints. VCONV, with its configurable design, can be deployed across multiple smaller FPGAs instead of a single large FPGA to provide better control over cost and parallel processing. VCONV can be deployed across heterogeneous FPGAs, depending on the performance requirements of each layer. The IP’s performance can be evaluated using embedded monitors to ensure that the accelerator is configured to achieve the best performance. VCONV can be configured for data type format, convolution engine (CE) and convolution unit (CU) configurations, as well as the sequence of operations based on the CNN model and layer. VCONV can be interfaced through the Advanced Peripheral Bus (APB) for configuration and the Advanced eXtensible Interface (AXI) stream for data transfers. The IP was implemented and validated on the Avnet Zedboard and tested on the first layer of AlexNet, VGG16, and ResNet18 with multiple CE configurations, demonstrating 100% performance from MAC units with no idle time. We also synthesized multiple VCONV instances required for AlexNet, achieving the lowest BRAM utilization of just 1.64 Mb and deriving a performance of 56GOPs.
Los estilos APA, Harvard, Vancouver, ISO, etc.
35

Brennsteiner, Stefan, Tughrul Arslan, John Thompson, and Andrew McCormick. "A Real-Time Deep Learning OFDM Receiver." ACM Transactions on Reconfigurable Technology and Systems 15, no. 3 (2022): 1–25. http://dx.doi.org/10.1145/3494049.

Texto completo
Resumen
Machine learning in the physical layer of communication systems holds the potential to improve performance and simplify design methodology. Many algorithms have been proposed; however, the model complexity is often unfeasible for real-time deployment. The real-time processing capability of these systems has not been proven yet. In this work, we propose a novel, less complex, fully connected neural network to perform channel estimation and signal detection in an orthogonal frequency division multiplexing system. The memory requirement, which is often the bottleneck for fully connected neural networks, is reduced by ≈ 27 times by applying known compression techniques in a three-step training process. Extensive experiments were performed for pruning and quantizing the weights of the neural network detector. Additionally, Huffman encoding was used on the weights to further reduce memory requirements. Based on this approach, we propose the first field-programmable gate array based, real-time capable neural network accelerator, specifically designed to accelerate the orthogonal frequency division multiplexing detector workload. The accelerator is synthesized for a Xilinx RFSoC field-programmable gate array, uses small-batch processing to increase throughput, efficiently supports branching neural networks, and implements superscalar Huffman decoders.
Los estilos APA, Harvard, Vancouver, ISO, etc.
36

Cho, Mannhee, and Youngmin Kim. "FPGA-Based Convolutional Neural Network Accelerator with Resource-Optimized Approximate Multiply-Accumulate Unit." Electronics 10, no. 22 (2021): 2859. http://dx.doi.org/10.3390/electronics10222859.

Texto completo
Resumen
Convolutional neural networks (CNNs) are widely used in modern applications for their versatility and high classification accuracy. Field-programmable gate arrays (FPGAs) are considered to be suitable platforms for CNNs based on their high performance, rapid development, and reconfigurability. Although many studies have proposed methods for implementing high-performance CNN accelerators on FPGAs using optimized data types and algorithm transformations, accelerators can be optimized further by investigating more efficient uses of FPGA resources. In this paper, we propose an FPGA-based CNN accelerator using multiple approximate accumulation units based on a fixed-point data type. We implemented the LeNet-5 CNN architecture, which performs classification of handwritten digits using the MNIST handwritten digit dataset. The proposed accelerator was implemented, using a high-level synthesis tool on a Xilinx FPGA. The proposed accelerator applies an optimized fixed-point data type and loop parallelization to improve performance. Approximate operation units are implemented using FPGA logic resources instead of high-precision digital signal processing (DSP) blocks, which are inefficient for low-precision data. Our accelerator model achieves 66% less memory usage and approximately 50% reduced network latency, compared to a floating point design and its resource utilization is optimized to use 78% fewer DSP blocks, compared to general fixed-point designs.
Los estilos APA, Harvard, Vancouver, ISO, etc.
37

Choubey, Abhishek, and Shruti Bhargava Choubey. "A Promising Hardware Accelerator with PAST Adder." Advances in Science and Technology 105 (April 2021): 241–48. http://dx.doi.org/10.4028/www.scientific.net/ast.105.241.

Texto completo
Resumen
Recent neural network research has demonstrated a significant benefit in machine learning compared to conventional algorithms based on handcrafted models and features. In regions such as video, speech and image recognition, the neural network is now widely adopted. But the high complexity of neural network inference in computation and storage poses great differences on its application. These networks are computer-intensive algorithms that currently require the execution of dedicated hardware. In this case, we point out the difficulty of Adders (MOAs) and their high-resource utilization in a CNN implementation of FPGA .to address these challenge a parallel self-time adder is implemented which mainly aims at minimizing the amount of transistors and estimating different factors for PASTA, i.e. field, power, delay.
Los estilos APA, Harvard, Vancouver, ISO, etc.
38

Huang, Hongmin, Zihao Liu, Taosheng Chen, Xianghong Hu, Qiming Zhang, and Xiaoming Xiong. "Design Space Exploration for YOLO Neural Network Accelerator." Electronics 9, no. 11 (2020): 1921. http://dx.doi.org/10.3390/electronics9111921.

Texto completo
Resumen
The You Only Look Once (YOLO) neural network has great advantages and extensive applications in computer vision. The convolutional layers are the most important part of the neural network and take up most of the computation time. Improving the efficiency of the convolution operations can greatly increase the speed of the neural network. Field programmable gate arrays (FPGAs) have been widely used in accelerators for convolutional neural networks (CNNs) thanks to their configurability and parallel computing. This paper proposes a design space exploration for the YOLO neural network based on FPGA. A data block transmission strategy is proposed and a multiply and accumulate (MAC) design, which consists of two 14 × 14 processing element (PE) matrices, is designed. The PE matrices are configurable for different CNNs according to the given required functions. In order to take full advantage of the limited logical resources and the memory bandwidth on the given FPGA device and to simultaneously achieve the best performance, an improved roofline model is used to evaluate the hardware design to balance the computing throughput and the memory bandwidth requirement. The accelerator achieves 41.99 giga operations per second (GOPS) and consumes 7.50 W running at the frequency of 100 MHz on the Xilinx ZC706 board.
Los estilos APA, Harvard, Vancouver, ISO, etc.
39

de Sousa, André L., Mário P. Véstias, and Horácio C. Neto. "Multi-Model Inference Accelerator for Binary Convolutional Neural Networks." Electronics 11, no. 23 (2022): 3966. http://dx.doi.org/10.3390/electronics11233966.

Texto completo
Resumen
Binary convolutional neural networks (BCNN) have shown good accuracy for small to medium neural network models. Their extreme quantization of weights and activations reduces off-chip data transfer and greatly reduces the computational complexity of convolutions. Further reduction in the complexity of a BCNN model for fast execution can be achieved with model size reduction at the cost of network accuracy. In this paper, a multi-model inference technique is proposed to reduce the execution time of the binarized inference process without accuracy reduction. The technique considers a cascade of neural network models with different computation/accuracy ratios. A parameterizable binarized neural network with different trade-offs between complexity and accuracy is used to obtain multiple network models. We also propose a hardware accelerator to run multi-model inference throughput in embedded systems. The multi-model inference accelerator is demonstrated on low-density Zynq-7010 and Zynq-7020 FPGA devices, classifying images from the CIFAR-10 dataset. The proposed accelerator improves the frame rate per number of LUTs by 7.2× those of previous solutions on a ZYNQ7020 FPGA with similar accuracy. This shows the effectiveness of the multi-model inference technique and the efficiency of the proposed hardware accelerator.
Los estilos APA, Harvard, Vancouver, ISO, etc.
40

Du, Wenhe, Shuoyu Chen, Lei Wang, and Ruili Chai. "Design of Yolov4-Tiny convolutional neural network hardware accelerator based on FPGA." Journal of Physics: Conference Series 2849, no. 1 (2024): 012005. http://dx.doi.org/10.1088/1742-6596/2849/1/012005.

Texto completo
Resumen
Abstract This article designs a Yolov4 Tiny convolutional neural network hardware accelerator based on FPGA. A four-stage pipeline convolutional array structure has been proposed. In the design, the NC4HW4 parameter rearrangement and Im2col dimensionality reduction algorithm are used as the core to maximize the parallelism of matrix operations under limited resources. Secondly, a PE convolutional computing unit structure was designed, and a resource-efficient and highly reliable convolutional computing module was implemented by combining INT8 DSP resource reuse technology. Finally, the accelerator will be deployed on Xilinx’s Zynq7030 development board. The experimental results show that at a clock frequency of 130 MHz, the power consumption of the hardware accelerator is only 2.723 W, and the performance is 59.54 Gbps. Compared with related research, it has improved more than 2.1 times. This accelerator can complete hardware-accelerated computing tasks in object detection with high energy efficiency.
Los estilos APA, Harvard, Vancouver, ISO, etc.
41

Kang, Soongyu, Seongjoo Lee, and Yunho Jung. "Design of Network-on-Chip-Based Restricted Coulomb Energy Neural Network Accelerator on FPGA Device." Sensors 24, no. 6 (2024): 1891. http://dx.doi.org/10.3390/s24061891.

Texto completo
Resumen
Sensor applications in internet of things (IoT) systems, coupled with artificial intelligence (AI) technology, are becoming an increasingly significant part of modern life. For low-latency AI computation in IoT systems, there is a growing preference for edge-based computing over cloud-based alternatives. The restricted coulomb energy neural network (RCE-NN) is a machine learning algorithm well-suited for implementation on edge devices due to its simple learning and recognition scheme. In addition, because the RCE-NN generates neurons as needed, it is easy to adjust the network structure and learn additional data. Therefore, the RCE-NN can provide edge-based real-time processing for various sensor applications. However, previous RCE-NN accelerators have limited scalability when the number of neurons increases. In this paper, we propose a network-on-chip (NoC)-based RCE-NN accelerator and present the results of implementation on a field-programmable gate array (FPGA). NoC is an effective solution for managing massive interconnections. The proposed RCE-NN accelerator utilizes a hierarchical–star (H–star) topology, which efficiently handles a large number of neurons, along with routers specifically designed for the RCE-NN. These approaches result in only a slight decrease in the maximum operating frequency as the number of neurons increases. Consequently, the maximum operating frequency of the proposed RCE-NN accelerator with 512 neurons increased by 126.1% compared to a previous RCE-NN accelerator. This enhancement was verified with two datasets for gas and sign language recognition, achieving accelerations of up to 54.8% in learning time and up to 45.7% in recognition time. The NoC scheme of the proposed RCE-NN accelerator is an appropriate solution to ensure the scalability of the neural network while providing high-performance on-chip learning and recognition.
Los estilos APA, Harvard, Vancouver, ISO, etc.
42

Wang, Yuejiao, Zhong Ma, and Zunming Yang. "Sequential Characteristics Based Operators Disassembly Quantization Method for LSTM Layers." Applied Sciences 12, no. 24 (2022): 12744. http://dx.doi.org/10.3390/app122412744.

Texto completo
Resumen
Embedded computing platforms such as neural network accelerators deploying neural network models need to quantize the values into low-bit integers through quantization operations. However, most current embedded computing platforms with a fixed-point architecture do not directly support performing the quantization operation for the LSTM layer. Meanwhile, the influence of sequential input data for LSTM has not been taken into account by quantization algorithms. Aiming at these two technical bottlenecks, a new sequential-characteristics-based operators disassembly quantization method for LSTM layers is proposed. Specifically, the calculation process of the LSTM layer is split into multiple regular layers supported by the neural network accelerator. The quantization-parameter-generation process is designed as a sequential-characteristics-based combination strategy for sequential and diverse image groups. Therefore, LSTM is converted into multiple mature operators for single-layer quantization and deployed on the neural network accelerator. Comparison experiments with the state of the art show that the proposed quantization method has comparable or even better performance than the full-precision baseline in the field of character-/word-level language prediction and image classification applications. The proposed method has strong application potential in the subsequent addition of novel operators for future neural network accelerators.
Los estilos APA, Harvard, Vancouver, ISO, etc.
43

Kumar, Pramod. "Review of Advanced Methods in Hardware Acceleration for Deep Neural Networks." International Journal for Research in Applied Science and Engineering Technology 12, no. 5 (2024): 4523–29. http://dx.doi.org/10.22214/ijraset.2024.62595.

Texto completo
Resumen
Abstract: Convolutional neural networks have become very efficient in performing tasks like Object Detection providing human like accuracy. However, their practical implementation needs significant hardware resources and memory bandwidth. In recent past a lot of research is being carried out for achieving higher efficiency in implementing such neural networks in hardware. We talk about FPGAs for hardware implementation due to their flexibility for customisation for such neural network architectures. In this paper we will discuss the metrics for efficient hardware accelerator and general methods available for achieving an efficient design. Further, we will discuss the actual methods used by recent research for implementation of deep neural networks particularly for object detection related applications. These methods range from actual ASIC design like TPUs [1] for on chip acceleration, state of the art open source designs like Gemini to methods like hardware reuse, re-configurable nodes and approximation in computations as a trade-off between speed and accuracy. This paper will be a valuable summary for the researchers starting in the field of hardware accelerators design for neural networks
Los estilos APA, Harvard, Vancouver, ISO, etc.
44

Kim, Jeonghun, and Sunggu Lee. "Fast Design Space Exploration for Always-On Neural Networks." Electronics 13, no. 24 (2024): 4971. https://doi.org/10.3390/electronics13244971.

Texto completo
Resumen
An analytical model can quickly predict performance and energy efficiency based on information about the neural network model and neural accelerator architecture, making it ideal for rapid pre-synthesis design space exploration. This paper proposes a new analytical model specifically targeted for convolutional neural networks used in always-on applications. To validate the proposed model, the performance and energy efficiency estimated by the model were compared with actual hardware and post-synthesis gate-level simulations of hardware synthesized with a state-of-the-art electronic design automation (EDA) synthesis tool. Comparisons with hardware created for the Eyeriss neural accelerator showed average execution time and energy consumption error rates of 3.33% and 13.54%, respectively. Comparisons with hardware synthesis results showed an error of 3.18% to 9.44% for two example neural accelerator configurations used to execute MobileNet, EfficientNet, and DarkNet neural network models. Finally, the utility of the proposed model was demonstrated by using it to evaluate the effects of different channel sizes, pruning rates, and batch sizes in several neural network designs for always-on vision, text, and audio processing.
Los estilos APA, Harvard, Vancouver, ISO, etc.
45

Cosatto, E., and H. P. Craf. "A neural network accelerator for image analysis." IEEE Micro 15, no. 3 (1995): 32–38. http://dx.doi.org/10.1109/40.387680.

Texto completo
Los estilos APA, Harvard, Vancouver, ISO, etc.
46

Kuznar, Damian, Robert Szczygiel, Piotr Maj, and Anna Kozioł. "Design of artificial neural network hardware accelerator." Journal of Instrumentation 18, no. 04 (2023): C04013. http://dx.doi.org/10.1088/1748-0221/18/04/c04013.

Texto completo
Resumen
Abstract We present a design of the scalable processor capable of providing an artificial neural network (ANN) functionality and in-house developed tools for automatic conversion of an ANN model designed with the TensorFlow library into an HDL code. The hardware is described in SystemVerilog and the synthesized module of the processor can perform calculations of a neural network with the speed exceeding 100 MHz. Our in-house designed software tool for ANN conversion supports translation of an arbitrary multilayer perceptron neural network into a state machine module, which performs necessary calculations. It is also dynamically reconfigurable so that the ANN operating on the hardware can be changed after it is deployed as an ASIC. The project aims the in-pixel implementation towards an X-ray photon energy estimation. The energy estimation shall be delivered with accuracy that exceeds the accuracy of an ADC converter that feeds the ANN with data.
Los estilos APA, Harvard, Vancouver, ISO, etc.
47

Xing, Siyuan, Qingyu Han, and Efstathios G. Charalampidis. "CombOpNet: a Neural-Network Accelerator for SINDy." Journal of Vibration Testing and System Dynamics 9, no. 1 (2025): 1–20. https://doi.org/10.5890/jvtsd.2025.03.001.

Texto completo
Los estilos APA, Harvard, Vancouver, ISO, etc.
48

Seto, Kenshu. "A Survey on System-Level Design of Neural Network Accelerators." Journal of Integrated Circuits and Systems 16, no. 2 (2021): 1–10. http://dx.doi.org/10.29292/jics.v16i2.505.

Texto completo
Resumen
In this paper, we present a brief survey on the system-level optimizations used for convolutional neural network (CNN) inference accelerators. For the nested loop of convolutional (CONV) layers, we discuss the effects of loop optimizations such as loop interchange, tiling, unrolling and fusion on CNN accelerators. We also explain memory optimizations that are effective with the loop optimizations. In addition, we discuss streaming architectures and single computation engine architectures that are commonly used in CNN accelerators. Optimizations for CNN models are briefly explained, followed by the recent trends and future directions of the CNN accelerator design.
Los estilos APA, Harvard, Vancouver, ISO, etc.
49

Paulenka, D. A. "Comparative analysis of single-board computers for the development of a microarchitectural computing system for fire detection." Informatics 21, no. 2 (2024): 73–85. http://dx.doi.org/10.37661/1816-0301-2024-21-2-73-85.

Texto completo
Resumen
Objectives. The purpose of the work is to select the basic computing microplatform of the onboard microarchitectural computing complex for the detection of anomalous situations in the territory of the Republic of Belarus from space on the basis of artificial intelligence methods.Methods. The method of comparative analysis is used to select a computing platform. A series of performance tests and comparative analysis (benchmarking) are performed on the selected equipment. The methods of comparative and benchmarking analysis are performed in accordance with the terms of reference to the current project.Results. A comparative analysis and performance testing of Raspberry Pi 4 Model B and Cool Pi 4 Model B single-board computers, as well as AI-accelerator Google Coral USB Accelerator with Google Edge TPU have been performed. The comparative analysis showed that Raspberry Pi 4 Model B and Cool Pi 4 Model B fully meet the terms of reference to the current project. At the same time Cool Pi 4 Model B handles neural network calculations well, but four times slower than similar calculations on Google Coral USB Accelerator. Neural network computations on the Raspberry Pi 4 Model B are 22 times slower than similar computations on the Google Coral USB Accelerator. Cool Pi 4 Model B outperforms Raspberry Pi 4 Model B by the factor of two to three for data copying and compression and almost six times faster for neural network computations.Conclusion. Despite the fact that Raspberry Pi 4 Model B meets the terms of reference to the project as a computational basis, when developing an on-board microarchitectural computing system for detecting anomalous situations, it is worth using more powerful alternatives with built-in AI-accelerators (e.g., Radxa Rock 5 Model A) or with an additional external AI-accelerator (e.g., a combination of Cool Pi 4 Model B and Google Coral USB Accelerator). Using a Raspberry Pi 4 Model B with an additional AI-accelerator is also acceptable and will speed up computations by several dozen times. AI-accelerators provide the fastest neural network computations, but there are features related to the novelty of the technology that will be explored in further development.
Los estilos APA, Harvard, Vancouver, ISO, etc.
50

Park, Sang-Soo, and Ki-Seok Chung. "CONNA: Configurable Matrix Multiplication Engine for Neural Network Acceleration." Electronics 11, no. 15 (2022): 2373. http://dx.doi.org/10.3390/electronics11152373.

Texto completo
Resumen
Convolutional neural networks (CNNs) have demonstrated promising results in various applications such as computer vision, speech recognition, and natural language processing. One of the key computations in many CNN applications is matrix multiplication, which accounts for a significant portion of computation. Therefore, hardware accelerators to effectively speed up the computation of matrix multiplication have been proposed, and several studies have attempted to design hardware accelerators to perform better matrix multiplications in terms of both speed and power consumption. Typically, accelerators with either a two-dimensional (2D) systolic array structure or a single instruction multiple data (SIMD) architecture are effective only when the input matrix has shapes that are close to or similar to a square. However, several CNN applications require multiplications of non-squared matrices with various shapes and dimensions, and such irregular shapes lead to poor utilization efficiency of the processing elements (PEs). This study proposes a configurable engine for neural network acceleration, called CONNA, whose computation engine can conduct matrix multiplications with highly utilized computing units, regardless of the access patterns, shapes, and dimensions of the input matrices by changing the shape of matrix multiplication conducted in the physical array. To verify the functionality of the CONNA accelerator, we implemented CONNA as an SoC platform that integrates a RISC-V MCU with CONNA on Xilinx VC707 FPGA. SqueezeNet on CONNA achieved an inference performance of 100 frames per second (FPS) with 2.36 mm2 and 83.55 mW in a 65 nm process, improving efficiency by up to 34.1 times better than existing accelerators in terms of FPS, silicon area, and power consumption.
Los estilos APA, Harvard, Vancouver, ISO, etc.
Ofrecemos descuentos en todos los planes premium para autores cuyas obras están incluidas en selecciones literarias temáticas. ¡Contáctenos para obtener un código promocional único!