Log in

Relevant bibliographies by topics / SIMD architecture / Journal articles

To see the other types of publications on this topic, follow the link: SIMD architecture.

Journal articles on the topic 'SIMD architecture'

Author: Grafiati

Published: 4 June 2021

Last updated: 27 April 2022

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'SIMD architecture.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Shen, Zheng, Hu He, Yanjun Zhang, and Yihe Sun. "A Video Specific Instruction Set Architecture for ASIP design." VLSI Design 2007 (November 15, 2007): 1–7. http://dx.doi.org/10.1155/2007/58431.

Full text

Abstract:

This paper describes a novel video specific instruction set architecture for ASIP design. With single instruction multiple data (SIMD) instructions, two destination modes, and video specific instructions, an instruction set architecture is introduced to enhance the performance for video applications. Furthermore, we quantify the improvement on H.263 encoding. In this paper, we evaluate and compare the performance of VS-ISA, other DSPs (digital signal processors), and conventional SIMD media extensions in the context of video coding. Our evaluation results show that VS-ISA improves the processor's performance by approximately 5x on H.263 encoding, and VS-ISA outperforms other architectures by 1.6x to 8.57x in computing IDCT.

APA, Harvard, Vancouver, ISO, and other styles

2

Wang, Guang, and Yin Sheng Gao. "An Implementation of Configurable SIMD Core on FPGA." Applied Mechanics and Materials 336-338 (July 2013): 1925–29. http://dx.doi.org/10.4028/www.scientific.net/amm.336-338.1925.

Full text

Abstract:

In order to meet the computing speed required by 4G wireless communications, and to provide the different data processing widths required by different algorithms, an SIMD (Single Instruction Multiple Data) core has been designed. The ISA (Instruction Set Architecture) and main components of the SIMD core are discussed focus on how the SIMD core can be configured. Finally, the simulation result of the multiplication of two 8*8 matrices is presented to show the execution of instructions in the proposed SIMD core, and the result verifies the correctness of the SIMD core design.

APA, Harvard, Vancouver, ISO, and other styles

3

Liu, Song Ping. "Optimization Method of Coherent Accumulation Operation Based on SIMD Architecture." Applied Mechanics and Materials 644-650 (September 2014): 4330–33. http://dx.doi.org/10.4028/www.scientific.net/amm.644-650.4330.

Full text

Abstract:

With SIMD functional unit as part of a wide range of application acceleration, how to effectively use this architecture to optimize the application to become a hot spot compiler optimization. This paper discusses the SIMD instructions and pipeline optimization methods. After using MMX instructions and pipeline optimization, real-time to achieve a strong guarantee pure software receiver.

APA, Harvard, Vancouver, ISO, and other styles

4

FUJITA, YOSHIHIRO, NOBUYUKI YAMASHITA, and SHIN-ICHIRO OKAZAKI. "IMAP: INTEGRATED MEMORY ARRAY PROCESSOR." Journal of Circuits, Systems and Computers 02, no. 03 (1992): 227–45. http://dx.doi.org/10.1142/s0218126692000155.

Full text

Abstract:

This paper presents architectural features and performances for an Integrated Memory Array Processor (IMAP) LSI, which integrates a large capacity memory and a one-dimensional SIMD processor array on a single chip. The IMAP has a conventional memory interface, almost the same as a dual port video RAM with operational input extension. SIMD processing is carried out on the IMAP chip, using an internal processor array, while other higher level processing is concurrently accomplished with external processors through the random access memory port. In addition to the basic IMAP architecture, this paper describes orthogonal IMAP, which has an extended IMAP architecture. The basic IMAP uses a conventional memory cell, while the orthogonal IMAP uses an orthogonal memory for holding images.

APA, Harvard, Vancouver, ISO, and other styles

5

Suaib, Mohammad, Abel Palaty, and Kumar Sambhav Pandey. "Architecture of SIMD Type Vector Processor." International Journal of Computer Applications 20, no. 4 (2011): 42–45. http://dx.doi.org/10.5120/2418-3233.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

Jiang, Li, Tianjian Li, Naifeng Jing, Nam Sung Kim, Minyi Guo, and Xiaoyao Liang. "CNFET-Based High Throughput SIMD Architecture." IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37, no. 7 (2018): 1331–44. http://dx.doi.org/10.1109/tcad.2017.2695899.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

Bruno, Alessandro, Fabrizio Pisacane, and Vittorio Rosato. "Simulation of a Two-Dimensional Dipolar System on a APE100/Quadrics Simd Architecture." International Journal of Modern Physics C 08, no. 03 (1997): 459–72. http://dx.doi.org/10.1142/s0129183197000382.

Full text

Abstract:

The temperature behavior of a system of dipoles with long-range interactions has been simulated via a two-dimensional lattice Monte Carlo on a massively SIMD platform (Quadrics/APE100). Thermodynamic quantities have been evaluated in order to locate and to characterize the phase transition in absence of applied field. Emphasis is given to the code implementation on the SIMD architecture and to the relevant features which have been used to improve code capabilities and performances.

APA, Harvard, Vancouver, ISO, and other styles

8

ZIPPEL, RICHARD. "THE DATA STRUCTURE ACCELERATOR ARCHITECTURE." International Journal of High Speed Electronics and Systems 07, no. 04 (1996): 533–71. http://dx.doi.org/10.1142/s012915649600030x.

Full text

Abstract:

We present a heterogeneous architecture that contains a fine grained, massively parallel SIMD component called the data structure accelerator and demonstrate its use in a number of problems in computational geometry including polygon filling and convex hull. The data structure accelerator is extremely dense and highly scalable. Systems of 106 processing elements can be embedded in workstations and personal computers, without dramatically changing their cost. These components are intended for use in tandem with conventional single sequence machines and with small scale, shared memory multiprocessors. A language for programming these heterogeneous systems is presented that smoothly incorporates the SIMD instructions of the data structure accelerator with conventional single sequence code. We then demonstrate how to construct a number of higher level primitives such as maximum and minimum, and apply these tools to problems in logic and computational geometry. For computational geometry problems, we demonstrate that simple algorithms that take advantage of the parallelism available on a data structure accelerator perform as well or better than the far more complex algorithms which are needed for comparable efficiency on single sequence computers.

APA, Harvard, Vancouver, ISO, and other styles

9

BHANDARKAR, SUCHENDRA M., HAMID R. ARABNIA, and JEFFREY W. SMITH. "A RECONFIGURABLE ARCHITECTURE FOR IMAGE PROCESSING AND COMPUTER VISION." International Journal of Pattern Recognition and Artificial Intelligence 09, no. 02 (1995): 201–29. http://dx.doi.org/10.1142/s0218001495000110.

Full text

Abstract:

In this paper we describe a reconfigurable architecture for image processing and computer vision based on a multi-ring network which we call a Reconfigurable Multi-Ring System (RMRS). We describe the reconfiguration switch for the RMRS and also describe its VLSI implementation. The RMRS topology is shown to be regular and scalable and hence well-suited for VLSI implementation. We prove some important properties of the RMRS topology and show that a broad class of algorithms for the n-cube can be mapped to the RMRS in a simple and elegant manner. We design and analyze a class of procedural primitives for the SIMD RMRS and show how these primitives can be used as building blocks for more complex parallel operations. We demonstrate the usefulness of the RMRS for problems in image processing and computer vision by considering two important operations—the Fast Fourier Transform (FFT) and the Hough transform for detection of linear features in an image. Parallel algorithms for the FFT and the Hough transform on the SIMD RMRS are designed using the aforementioned procedural primitives. The analysis of the complexity of these algorithms shows that the SIMD RMRS is a viable architecture for problems in computer vision and image processing.

APA, Harvard, Vancouver, ISO, and other styles

10

Moudgill, Mayan, Andrei Iancu, and Daniel Iancu. "Galois Field Instructions in the Sandblaster 2.0 Architectrue." International Journal of Digital Multimedia Broadcasting 2009 (2009): 1–5. http://dx.doi.org/10.1155/2009/129698.

Full text

Abstract:

This paper presents a novel approach to implementing multiplication of Galois Fields with . Elements of GF() can be represented as polynomials of degree less than N over GF(2). Operations are performed modulo an irreducible polynomial of degree n over GF(2). Our approach splits a Galois Field multiply into two operations, polynomial-multiply and polynomial-remainder over GF(2). We show how these two operations can be implemented using the same hardware. Further, we show that in many cases several polynomial-multiply operations can be combined before needing to a polynomial-remainder. The Sandblaster 2.0 is a SIMD architecture. It has SIMD variants of the poly-multiply and poly-remainder instructions. We use a Reed-Solomon encoder and decoder to demonstrate the performance of our approach. Our new approach achieves speedup of 11.5x compared to the standard SIMD processor of 8x.

APA, Harvard, Vancouver, ISO, and other styles

11

Liu, Dake, Joar Sohl, and Jian Wang. "Parallel Programming and Its Architectures Based on Data Access Separated Algorithm Kernels." International Journal of Embedded and Real-Time Communication Systems 1, no. 1 (2010): 64–85. http://dx.doi.org/10.4018/jertcs.2010103004.

Full text

Abstract:

A novel master-multi-SIMD architecture and its kernel (template) based parallel programming flow is introduced as a parallel signal processing platform. The name of the platform is ePUMA (embedded Parallel DSP processor architecture with Unique Memory Access). The essential technology is to separate data accessing kernels from arithmetic computing kernels so that the run-time cost of data access can be minimized by running it in parallel with algorithm computing. The SIMD memory subsystem architecture based on the proposed flow dramatically improves the total computing performance. The hardware system and programming flow introduced in this article will primarily aim at low-power high-performance embedded parallel computing with low silicon cost for communications and similar real-time signal processing.

APA, Harvard, Vancouver, ISO, and other styles

12

Kim, Yongjoo, Jongeun Lee, Jinyong Lee, and Yunheung Paek. "Scalable Application Mapping for SIMD Reconfigurable Architecture." JSTS:Journal of Semiconductor Technology and Science 15, no. 6 (2015): 634–46. http://dx.doi.org/10.5573/jsts.2015.15.6.634.

Full text

APA, Harvard, Vancouver, ISO, and other styles

13

Auguin, M., F. Boeri, J. P. Dalban, and A. Vincent-Carrefour. "Experience using a SIMD/SPMD multiprocessor architecture." Microprocessing and Microprogramming 21, no. 1-5 (1987): 171–77. http://dx.doi.org/10.1016/0165-6074(87)90034-2.

Full text

APA, Harvard, Vancouver, ISO, and other styles

14

Patwardhan, Jaidev, Chris Dwyer, and Alvin R. Lebeck. "A self-organizing defect tolerant SIMD architecture." ACM Journal on Emerging Technologies in Computing Systems 3, no. 2 (2007): 10. http://dx.doi.org/10.1145/1265949.1265956.

Full text

APA, Harvard, Vancouver, ISO, and other styles

15

Wójcik, Zbigniew. "Parallel shape coding on a SIMD architecture." Engineering Applications of Artificial Intelligence 3, no. 1 (1990): 11–18. http://dx.doi.org/10.1016/0952-1976(90)90017-g.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Nudd, Graham, Nick Francis, Tim Atherton, Darren Kerbyson, Roger Packwood, and John Vaudin. "Hierarchical multiple-SIMD architecture for image analysis." Machine Vision and Applications 5, no. 2 (1992): 85–103. http://dx.doi.org/10.1007/bf02620309.

Full text

APA, Harvard, Vancouver, ISO, and other styles

17

Bariani, M., P. Lambruschini, and M. Raggio. "An Efficient Multi-Core SIMD Implementation for H.264/AVC Encoder." VLSI Design 2012 (May 29, 2012): 1–14. http://dx.doi.org/10.1155/2012/413747.

Full text

Abstract:

The optimization process of a H.264/AVC encoder on three different architectures is presented. The architectures are multi- and singlecore and SIMD instruction sets have different vector registers size. The need of code optimization is fundamental when addressing HD resolutions with real-time constraints. The encoder is subdivided in functional modules in order to better understand where the optimization is a key factor and to evaluate in details the performance improvement. Common issues in both partitioning a video encoder into parallel architectures and SIMD optimization are described, and author solutions are presented for all the architectures. Besides showing efficient video encoder implementations, one of the main purposes of this paper is to discuss how the characteristics of different architectures and different set of SIMD instructions can impact on the target application performance. Results about the achieved speedup are provided in order to compare the different implementations and evaluate the more suitable solutions for present and next generation video-coding algorithms.

APA, Harvard, Vancouver, ISO, and other styles

18

Qi, Jin, Can Qun Yang, Cheng Chen, Qiang Wu, and Tao Tang. "Accelerating IDCT Algorithm on Xeon Phi Coprocessor." Advanced Materials Research 756-759 (September 2013): 3114–20. http://dx.doi.org/10.4028/www.scientific.net/amr.756-759.3114.

Full text

Abstract:

Inverse Discrete Cosine Transform (IDCT) is an important operation for image and videos decompression. How to accelerate the IDCT algorithm has been frequently studied. Recently Intel has proposed Xeon Phi coprocessors based on the many integrated core (MIC) architecture. Xeon Phi is integrated with 61 cores and 512-bit SIMD extension within each core, thus providing very high performance. In this paper, we employ the Knights Corner (a beta version of Xeon Phi) to accelerate the IDCT algorithm. By employing the 512-bit SIMD instruction and data pre-fetching optimization, our implementation achieves (1) averagely 5.82 speedup over the none-SIMD version, (2) averagely 27.3% performance benefit with the data pre-fetching optimization, and (3) averagely 1.53 speedup on one Knights Corner coprocessor over the implementation on one octal-core Intel Xeon E5-2670 CPU.

APA, Harvard, Vancouver, ISO, and other styles

19

Kunzman, David M., and Laxmikant V. Kalé. "Programming Heterogeneous Clusters with Accelerators Using Object-Based Programming." Scientific Programming 19, no. 1 (2011): 47–62. http://dx.doi.org/10.1155/2011/525717.

Full text

Abstract:

Heterogeneous clusters that include accelerators have become more common in the realm of high performance computing because of the high GFlop/s rates such clusters are capable of achieving. However, heterogeneous clusters are typically considered hard to program as they usually require programmers to interleave architecture-specific code within application code. We have extended the Charm++ programming model and runtime system to support heterogeneous clusters (with host cores that differ in their architecture) that include accelerators. We are currently focusing on clusters that include commodity processors, Cell processors, and Larrabee devices. When our extensions are used to develop code, the resulting code is portable between various homogeneous and heterogeneous clusters that may or may not include accelerators. Using a simple example molecular dynamics (MD) code, we demonstrate our programming model extensions and runtime system modifications on a heterogeneous cluster comprised of Xeon and Cell processors. Even though there is no architecture-specific code in the example MD program, it is able to successfully make use of three core types, each with a different ISA (Xeon, PPE, SPE), three SIMD instruction extensions (SSE, AltiVec/VMX and the SPE's SIMD instructions), and two memory models (cache hierarchies and scratchpad memories) in a single execution. Our programming model extensions abstract away hardware complexities while our runtime system modifications automatically adjust application data to account for architectural differences between the various cores.

APA, Harvard, Vancouver, ISO, and other styles

20

KUTIL, RADE, and PETER EDER. "PARALLELIZATION OF WAVELET FILTERS USING SIMD EXTENSIONS." Parallel Processing Letters 16, no. 03 (2006): 335–49. http://dx.doi.org/10.1142/s012962640600268x.

Full text

Abstract:

Much work has been done to optimize wavelet transforms for SIMD extensions of modern CPUs. However, these approaches are mostly restricted to the vertical part of 2-D transforms with line-wise organized memory layouts because this leads to a rather straight forward SIMD-implementation. This work shows for an example of a common wavelet filter new approaches to use SIMD operations on 1-D transforms that are able to produce reasonable speedups. As a result, the performance of algorithms that use wavelet transforms, such as JPEG2000, can be increased significantly. Various variants of parallelization are presented and compared. Their advantages and disadvantages for general filters are discussed.

APA, Harvard, Vancouver, ISO, and other styles

21

DATTA, ABHIJIT, SHIRISH V. JOSHI, and RABI N. MAHAPATRA. "MODELLING A MORPHOLOGICAL THINNING ALGORITHM FOR SHARED MEMORY SIMD COMPUTERS." Parallel Processing Letters 01, no. 01 (1991): 59–65. http://dx.doi.org/10.1142/s0129626491000227.

Full text

Abstract:

This letter presents the modelling of a morphological thinning algorithm suggested by Jang and Chin [1] on the four models of shared memory SIMD computers. The time and cost complexity analyses for the models have been given. The performance of this algorithm on SIMD computers has been compared with the performance of a conventional thinning algorithm [2] proposed recently.

APA, Harvard, Vancouver, ISO, and other styles

22

Zhang, Weihua, Xinglong Qian, Ye Wang, Binyu Zang, and Chuanqi Zhu. "Optimizing compiler for shared-memory multiple SIMD architecture." ACM SIGPLAN Notices 41, no. 7 (2006): 199–208. http://dx.doi.org/10.1145/1159974.1134679.

Full text

APA, Harvard, Vancouver, ISO, and other styles

23

Lo, Wing-Yee, Daniel Pak-Kong Lun, Wan-Chi Siu, Wendong Wang, and Jiqiang Song. "Improved SIMD Architecture for High Performance Video Processors." IEEE Transactions on Circuits and Systems for Video Technology 21, no. 12 (2011): 1769–83. http://dx.doi.org/10.1109/tcsvt.2011.2130250.

Full text

APA, Harvard, Vancouver, ISO, and other styles

24

Patwardhan, Jaidev P., Vijeta Johri, Chris Dwyer, and Alvin R. Lebeck. "A defect tolerant self-organizing nanoscale SIMD architecture." ACM SIGOPS Operating Systems Review 40, no. 5 (2006): 241–51. http://dx.doi.org/10.1145/1168917.1168888.

Full text

APA, Harvard, Vancouver, ISO, and other styles

25

Patwardhan, Jaidev P., Vijeta Johri, Chris Dwyer, and Alvin R. Lebeck. "A defect tolerant self-organizing nanoscale SIMD architecture." ACM SIGPLAN Notices 41, no. 11 (2006): 241–51. http://dx.doi.org/10.1145/1168918.1168888.

Full text

APA, Harvard, Vancouver, ISO, and other styles

26

Patwardhan, Jaidev P., Vijeta Johri, Chris Dwyer, and Alvin R. Lebeck. "A defect tolerant self-organizing nanoscale SIMD architecture." ACM SIGARCH Computer Architecture News 34, no. 5 (2006): 241–51. http://dx.doi.org/10.1145/1168919.1168888.

Full text

APA, Harvard, Vancouver, ISO, and other styles

27

Wang, Guang, and Xiang Jun Li. "A Design of SIMD Core Based on PIM Technology." Advanced Materials Research 753-755 (August 2013): 2498–502. http://dx.doi.org/10.4028/www.scientific.net/amr.753-755.2498.

Full text

Abstract:

An exhaustive design to the micro-architecture of SIMD core based on PIM technology is made; meanwhile the system architecture is implemented completely by applying Verilog hardware description language, and is simulated by the simulation software Xilinx ISE, the verification of the functional correctness is obtained as well via the simulation waveform. As the result, it can be concluded that the bandwidth and the delay of data accessing can be increased and reduced respectively by making full use of PIM technology, and then the performance of the entire system can be greatly improved accordingly.

APA, Harvard, Vancouver, ISO, and other styles

28

Devireddy, Srinivasa Kumar, Iyyanki V. Murali Krishna, and Venkateswara Rao Tiruveedhula. "Real-time Face Recognition Using SIMD and VLIW Architecture." Journal of Computing and Information Technology 15, no. 2 (2007): 143. http://dx.doi.org/10.2498/cit.1000899.

Full text

APA, Harvard, Vancouver, ISO, and other styles

29

Waeijen, Luc, Dongrui She, Henk Corporaal, and Yifan He. "A Low-Energy Wide SIMD Architecture with Explicit Datapath." Journal of Signal Processing Systems 80, no. 1 (2014): 65–86. http://dx.doi.org/10.1007/s11265-014-0950-8.

Full text

APA, Harvard, Vancouver, ISO, and other styles

30

Wei, Haitao, Yu Junqing, and Li Jiang. "The design and evaluation of hierarchical multi-level parallelisms for H.264 encoder on multi-core architecture." Computer Science and Information Systems 7, no. 1 (2010): 189–200. http://dx.doi.org/10.2298/csis1001189w.

Full text

Abstract:

As a video coding standard, H.264 achieves high compress rate while keeping good fidelity. But it requires more intensive computation than before to get such high coding performance. A Hierarchical Multi-level Parallelisms (HMLP) framework for H.264 encoder is proposed which integrates four level parallelisms - frame-level, slice-level, macroblock-level and data-level into one implementation. Each level parallelism is designed in a hierarchical parallel framework and mapped onto the multi-cores and SIMD units on multi-core architecture. According to the analysis of coding performance on each level parallelism, we propose a method to combine different parallel levels to attain a good compromise between high speedup and low bit-rate. The experimental results show that for CIF format video, our method achieves the speedup of 33.57x-42.3x with 1.04x-1.08x bit-rate increasing on 8-core Intel Xeon processor with SIMD Technology.

APA, Harvard, Vancouver, ISO, and other styles

31

Степаненко, Сергей, Sergey Stepanenko, Василий Южаков, and Vasiliy Yuzhakov. "Exascale supercomputers. Architectural outlines." Program systems: theory and applications 4, no. 4 (2013): 61–90. http://dx.doi.org/10.12737/2418.

Full text

Abstract:

Architectural aspects of exascale supercomputers are explored. Param-eters of the computing environment and interconnect are evaluated. It is shown that reaching exascale performances requires hybrid systems. Processor elements of such systems comprise CPU cores and arithmetic accelerators, implementing the MIMD and SIMD computing disciplines, respectively. Efficient exascale hybrid systems require fundamentally new applications and architectural efficiency scaling solutions, including: 1) process-aware structural reconfiguring of hybrid processor elements by varying the number of MIMD cores and SIMD cores communicating with them to attain as high performance and efficiency as possible under given conditions; 2) application of conflict-free sets of sources and receivers and/or decomposi-tion of the computation to subprocesses and their allocation to environment elements in accordance with their features and communication topology to minimize communication time; 3) application of topological redundancy methods to preserve the topology and overall performance achieved by the above communication time minimiza-tion solutions in case of element failure thus maintaining the efficiency reached by the above reconfiguring and communication minimization solu-tions, i.e. to provide fault-tolerant efficiency scaling. Application of these solutions is illustrated by running molecular dynamics tests and the NPB LU benchmark. The resulting architecture displays dynamic adaptability to program features, which in turn ensures the efficiency of using exascale supercomputers.

APA, Harvard, Vancouver, ISO, and other styles

32

Chen, Cheng, Can Qun Yang, Wen Ke Yao, Jin Qi, and Qiang Wu. "Accelerating PQMRCGSTAB Algorithm on Xeon Phi." Advanced Materials Research 709 (June 2013): 555–62. http://dx.doi.org/10.4028/www.scientific.net/amr.709.555.

Full text

Abstract:

Utilizing iterative method to solve the large sparse linear systems is the key to many practical mathematical and physical problems. Recently, Intel released Xeon Phi, a many-core processor of Intel’s Many Integrated Core (MIC) architecture, comprises 60 cores and supports 512-bit SIMD operation. In this work, we aim at accelerating an iterative algorithm for large spare linear system, named PQMRCGSTAB, by using both Xeon Phi’s 8-way vector operation and dense threads. Then, we propose three optimizations to improve the performance: data prefetching to hide the data latency, vector register reusing, and SIMD-friendly reduction. Our experimental evaluation on Xeon Phi delivers a speedup of close to a factor 6 compared to the Intel Xeon E5-2670 octal-core CPU running the same problem.

APA, Harvard, Vancouver, ISO, and other styles

33

Seo, Hwajeong, Hyunjun Kim, Kyungbae Jang, et al. "Secure HIGHT Implementation on ARM Processors." Mathematics 9, no. 9 (2021): 1044. http://dx.doi.org/10.3390/math9091044.

Full text

Abstract:

Secure and compact designs of HIGHT block cipher on representative ARM microcontrollers are presented in this paper. We present several optimizations for implementations of the HIGHT block cipher, which exploit different parallel approaches, including task parallelism and data parallelism methods, for high-speed and high-throughput implementations. For the efficient parallel implementation of the HIGHT block cipher, the SIMD instructions of ARM architecture are fully utilized. These instructions support four-way 8-bit operations in the parallel way. The length of primitive operations in the HIGHT block cipher is 8-bit-wise in addition–rotation–exclusive-or operations. In the 32-bit word architecture (i.e., the 32-bit ARM architecture), four 8-bit operations are executed at once with the four-way SIMD instruction. By exploiting the SIMD instruction, three parallel HIGHT implementations are presented, including task-parallel, data-parallel, and task/data-parallel implementations. In terms of the secure implementation, we present a fault injection countermeasure for 32-bit ARM microcontrollers. The implementation ensures the fault detection through the representation of intra-instruction redundancy for the data format. In particular, we proposed two fault detection implementations by using parallel implementations. The two-way task/data-parallel based implementation is secure against fault injection models, including chosen bit pair, random bit, and random byte. The alternative four-way data-parallel-based implementation ensures all security features of the aforementioned secure implementations. Moreover, the instruction skip model is also prevented. The implementation of the HIGHT block cipher is further improved by using the constant value of the counter mode of operation. In particular, the 32-bit nonce value is pre-computed and the intermediate result is directly utilized. Finally, the optimized implementation achieved faster execution timing and security features toward the fault attack than previous works.

APA, Harvard, Vancouver, ISO, and other styles

34

PERRI, STEFANIA, MARIA ANTONIA IACHINO, and PASQUALE CORSONELLO. "SIMD MULTIPLIERS FOR ACCELERATING EMBEDDED PROCESSORS IN FPGAs." Journal of Circuits, Systems and Computers 15, no. 04 (2006): 537–50. http://dx.doi.org/10.1142/s0218126606003210.

Full text

Abstract:

This paper describes a new efficient 32×32 Single Instruction Multiple Data (SIMD) multiplier suitable for the multimedia extension of FPGA-based processors. The proposed circuit can adapt itself to 32-, 16-, and 8-bit operands widths avoiding time and power consuming reconfiguration. When implemented in an XCV400 device, the multiplier here described reaches a running frequency of about 97 MHz with an energy dissipation of just 20 mW/MHz. Comparisons with previously proposed SIMD multipliers for FPGA-based designs demonstrate that the new circuit allows the best area-time-power trade-off to be obtained.

APA, Harvard, Vancouver, ISO, and other styles

35

Banu, J. Saira, and M. Rajasekhara Babu. "Exploring Vectorization and Prefetching Techniques on Scientific Kernels and Inferring the Cache Performance Metrics." International Journal of Grid and High Performance Computing 7, no. 2 (2015): 18–36. http://dx.doi.org/10.4018/ijghpc.2015040102.

Full text

Abstract:

Performance improvement in modern processor is staggering due to power wall and memory wall problem. In general, the power wall problem is addressed by various vectorization design techniques. The Memory wall problem is diminished through prefetching technique. In this paper vectorization is achieved through Single Instruction Multiple Data (SIMD) registers of the current processor. It provides architecture optimization by reducing the number of instructions in the pipeline and by minimizing the utilization of multi-level memory hierarchy. These registers provide an economical computing platform compared to Graphics Processing Unit (GPU) for compute intensive applications. This paper explores software prefetching via Streaming SIMD extension (SSE) instructions to mitigate the memory wall problem. This work quantifies the effect of vectorization and prefetching in Matrix Vector Multiplication (MVM) kernel with dense and sparse structure. Both Prefetching and Vectorization method reduces the data and instruction cache pressure and thereby improving the cache performance. To show the cache performance improvements in the kernel, the Intel VTune amplifier is used. Finally, experimental results demonstrate a promising performance of matrix kernel by Intel Haswell's processor. However, effective utilization of SIMD registers is a programming challenge to the developers.

APA, Harvard, Vancouver, ISO, and other styles

36

Kwon, Ki-Pyo, and Jae-Heung Lee. "A Speed-up Method of HOG Pedestrian Detector in Advanced SIMD Architecture." Journal of IKEEE 18, no. 1 (2014): 106–13. http://dx.doi.org/10.7471/ikeee.2014.18.1.106.

Full text

APA, Harvard, Vancouver, ISO, and other styles

37

PAOLUCCI, P. S. "N-BODY CLASSICAL SYSTEMS AND NEURAL NETWORKS ON A 3D SIMD MASSIVE PARALLEL PROCESSOR: APE100/QUADRICS." International Journal of Modern Physics C 06, no. 02 (1995): 169–82. http://dx.doi.org/10.1142/s0129183195000137.

Full text

Abstract:

A number of physical systems (e.g., N body Newtonian, Coulombian or Lennard-Jones systems) can be described by N2 interaction terms. Completely connected neural networks are characterised by the same kind of connections: Each neuron sends signals to all the other neurons via synapses. The APE100/Quadricsmassive parallel architecture, with processing power in excess of 100 Gigaflops and a central memory of 8 Gigabytes seems to have processing power and memory adequate to simulate systems formed by more than 1 billion synapses or interaction terms. On the other hand the processing nodes of APE100/Quadrics are organised in a tridimensional cubic lattice; each processing node has a direct communication path only toward the first neighboring nodes. Here we describe a convenient way to map systems with global connectivity onto the first-neighbors connectivity of the APE100/Quadrics architecture. Some numeric criteria, which are useful for matching SIMD tridimensional architectures with globally connected simulations, are introduced.

APA, Harvard, Vancouver, ISO, and other styles

38

Ortiz, Ariel. "Teaching the SIMD execution model:." ACM SIGCSE Bulletin 35, no. 1 (2003): 74–78. http://dx.doi.org/10.1145/792548.611936.

Full text

APA, Harvard, Vancouver, ISO, and other styles

39

Morad, Amir, Leonid Yavits, and Ran Ginosar. "GP-SIMD Processing-in-Memory." ACM Transactions on Architecture and Code Optimization 11, no. 4 (2015): 1–26. http://dx.doi.org/10.1145/2686875.

Full text

APA, Harvard, Vancouver, ISO, and other styles

40

Wang, Guang, and Yin Sheng Gao. "A Control Path Design of Communications Processor." Advanced Materials Research 694-697 (May 2013): 1459–64. http://dx.doi.org/10.4028/www.scientific.net/amr.694-697.1459.

Full text

Abstract:

With the widespread popularity of wireless mobile devices, the demands of emerging applications have been proposed, such as video telephony and HD video play. The next generation of wireless mobile computing devices needs to have higher data transfer rate, more complex algorithms, as well as low power consumption. This paper gives out a variable-width SIMD processor architectures controller and data transfer module designing method. The controller designing focused on controlling the program flow and initializing the data transfer module. By simulating the architecture on Xilinx ISE Design Suite, the fundamental modules have been designed and tested. The results of simulation verify the correctness of the controller and data transfer module design.

APA, Harvard, Vancouver, ISO, and other styles

41

Danysh, A., and D. Tan. "Architecture and implementation of a vector/SIMD multiply-accumulate unit." IEEE Transactions on Computers 54, no. 3 (2005): 284–93. http://dx.doi.org/10.1109/tc.2005.41.

Full text

APA, Harvard, Vancouver, ISO, and other styles

42

Chhugani, Jatin, Anthony D. Nguyen, Victor W. Lee, et al. "Efficient implementation of sorting on multi-core SIMD CPU architecture." Proceedings of the VLDB Endowment 1, no. 2 (2008): 1313–24. http://dx.doi.org/10.14778/1454159.1454171.

Full text

APA, Harvard, Vancouver, ISO, and other styles

43

Ouerhani, Nabil, and Heinz Hügli. "Real-time visual attention on a massively parallel SIMD architecture." Real-Time Imaging 9, no. 3 (2003): 189–96. http://dx.doi.org/10.1016/s1077-2014(03)00036-6.

Full text

APA, Harvard, Vancouver, ISO, and other styles

44

Miller, T., S. Alexander, and L. Faber. "An SIMD Multiprocessor Ring Architecture for the LMS Adaptive Algorithm." IEEE Transactions on Communications 34, no. 1 (1986): 89–92. http://dx.doi.org/10.1109/tcom.1986.1096423.

Full text

APA, Harvard, Vancouver, ISO, and other styles

45

R, Maheswari, Pattabiraman V, and Sharmila P. "RECONFIGURABLE FPGA BASED SOFT-CORE PROCESSOR FOR SIMD APPLICATIONS." Asian Journal of Pharmaceutical and Clinical Research 10, no. 13 (2017): 180. http://dx.doi.org/10.22159/ajpcr.2017.v10s1.19632.

Full text

Abstract:

Objective: The prospective need of SIMD (Single Instruction and Multiple Data) applications like video and image processing in single system requires greater flexibility in computation to deliver high quality real time data. This paper performs an analysis of FPGA (Field Programmable Gate Array) based high performance Reconfigurable OpenRISC1200 (ROR) soft-core processor for SIMD.Methods: The ROR1200 ensures performance improvement by data level parallelism executing SIMD instruction simultaneously in HPRC (High Performance Reconfigurable Computing) at reduced resource utilization through RRF (Reconfigurable Register File) with multiple core functionalities. This work aims at analyzing the functionality of the reconfigurable architecture, by illustrating the implementation of two different image processing operations such as image convolution and image quality improvement. The MAC (Multiply-Accumulate) unit of ROR1200 used to perform image convolution and execution unit with HPRC is used for image quality improvement.Result: With parallel execution in multi-core, the proposed processor improves image quality by doubling the frame rate up-to 60 fps (frames per second) with peak power consumption of 400mWatt. Thus the processor gives a significant computational cost of 12ms with a refresh rate of 60Hz and 1.29ns of MAC critical path delay.Conclusion:This FPGA based processor becomes a feasible solution for portable embedded SIMD based applications which need high performance at reduced power consumptions

APA, Harvard, Vancouver, ISO, and other styles

46

FUJIMOTO, NORIYUKI. "DENSE MATRIX-VECTOR MULTIPLICATION ON THE CUDA ARCHITECTURE." Parallel Processing Letters 18, no. 04 (2008): 511–30. http://dx.doi.org/10.1142/s0129626408003545.

Full text

Abstract:

Recently GPUs have acquired the ability to perform fast general purpose computation by running thousands of threads concurrently. This paper presents a new algorithm for dense matrix-vector multiplication on the NVIDIA CUDA architecture. The experiments are conducted on a PC with GeForce 8800GTX and 2.0 GHz Intel Xeon E5335 CPU. The results show that the proposed algorithm runs a maximum of 11.19 times faster than NVIDIA's BLAS library CUBLAS 1.1 on the GPU and 35.15 times faster than the Intel Math Kernel Library 9.1 on a single core x86 with SSE3 SIMD instructions. The performance of Jacobi's iterative method for solving linear equations, which includes the data transfer time between CPU and GPU, shows that the proposed algorithm is practical for real applications.

APA, Harvard, Vancouver, ISO, and other styles

47

Matvienko, Sergey, Nikolay Alemasov, and Eduard Fomin. "Interaction sorting method for molecular dynamics on multi-core SIMD CPU architecture." Journal of Bioinformatics and Computational Biology 13, no. 01 (2015): 1540004. http://dx.doi.org/10.1142/s0219720015400041.

Full text

Abstract:

Molecular dynamics (MD) is widely used in computational biology for studying binding mechanisms of molecules, molecular transport, conformational transitions, protein folding, etc. The method is computationally expensive; thus, the demand for the development of novel, much more efficient algorithms is still high. Therefore, the new algorithm designed in 2007 and called interaction sorting (IS) clearly attracted interest, as it outperformed the most efficient MD algorithms. In this work, a new IS modification is proposed which allows the algorithm to utilize SIMD processor instructions. This paper shows that the improvement provides an additional gain in performance, 9% to 45% in comparison to the original IS method.

APA, Harvard, Vancouver, ISO, and other styles

48

Fernandez Declara, Placido, and J. Daniel Garcia. "Compass SPMD: a SPMD vectorized tracking algorithm." EPJ Web of Conferences 245 (2020): 01006. http://dx.doi.org/10.1051/epjconf/202024501006.

Full text

Abstract:

Compass is a SPMD (Single Program Multiple Data) tracking algorithm for the upcoming LHCb upgrade in 2021. 40 Tb/s need to be processed in real-time to select events. Alternative frameworks, algorithms and architectures are being tested to cope with the deluge of data. Allen is a research and development project aiming to run the full HLT1 (High Level Trigger) on GPUs (Graphics Processing Units). Allen’s architecture focuses on data-oriented layout and algorithms to better exploit parallel architectures. GPUs already proved to exploit the framework efficiently with the algorithms developed for Allen, implemented and optimized for GPU architectures. We explore opportunities for the SIMD (Single Instruction Multiple Data) paradigm in CPUs through the Compass algorithm. We use the Intel SPMD Program Compiler (ISPC) to achieve good readability, maintainability and performance writing “GPU-like” source code, preserving the main design of the algorithm.

APA, Harvard, Vancouver, ISO, and other styles

49

FEIL, MANFRED, ANDREAS UHL, and MARIAN VAJTERŠIC. "COMPUTATION OF THE CONTINUOUS WAVELET TRANSFORM ON MASSIVELY PARALLEL SIMD ARRAYS." Parallel Processing Letters 09, no. 04 (1999): 453–66. http://dx.doi.org/10.1142/s0129626499000426.

Full text

Abstract:

Strategies for computing the continuous wavelet transform on massively parallel SIMD arrays are introduced and discussed. The different approaches are theoretically assessed and the results of implementations on a MasPar MP-2 are compared.

APA, Harvard, Vancouver, ISO, and other styles

50

Hong, Ding-Yong, Yu-Ping Liu, Sheng-Yu Fu, Jan-Jan Wu, and Wei-Chung Hsu. "Improving SIMD Parallelism via Dynamic Binary Translation." ACM Transactions on Embedded Computing Systems 17, no. 3 (2018): 1–27. http://dx.doi.org/10.1145/3173456.

Full text

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!