To see the other types of publications on this topic, follow the link: Dataflow computation.

Journal articles on the topic 'Dataflow computation'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Dataflow computation.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Handa, Shivam, Konstantinos Kallas, Nikos Vasilakis, and Martin C. Rinard. "An order-aware dataflow model for parallel Unix pipelines." Proceedings of the ACM on Programming Languages 5, ICFP (2021): 1–28. http://dx.doi.org/10.1145/3473570.

Full text
Abstract:
We present a dataflow model for modelling parallel Unix shell pipelines. To accurately capture the semantics of complex Unix pipelines, the dataflow model is order-aware, i.e., the order in which a node in the dataflow graph consumes inputs from different edges plays a central role in the semantics of the computation and therefore in the resulting parallelization. We use this model to capture the semantics of transformations that exploit data parallelism available in Unix shell computations and prove their correctness. We additionally formalize the translations from the Unix shell to the dataflow model and from the dataflow model back to a parallel shell script. We implement our model and transformations as the compiler and optimization passes of a system parallelizing shell pipelines, and use it to evaluate the speedup achieved on 47 pipelines.
APA, Harvard, Vancouver, ISO, and other styles
2

MUSKULUS, MICHAEL, and ROBERT BRIJDER. "COMPLEXITY OF BIO-COMPUTATION: SYMBOLIC DYNAMICS IN MEMBRANE SYSTEMS." International Journal of Foundations of Computer Science 17, no. 01 (2006): 147–65. http://dx.doi.org/10.1142/s0129054106003747.

Full text
Abstract:
We discuss aspects of biological relevance to the modelling of bio-computation in a multiset rewriting system context: turnover, robustness against perturbations, and the dataflow programming paradigm. The systems under consideration are maximally parallel and asynchronous parallel membrane systems, the latter corresponding to computation in which the notion of time is operationally meaningless. A natural geometrical setting which seems promising for the study of computational processes in general multiset rewriting systems is presented. Configuration space corresponds to a subset of the lattice [Formula: see text], d ∈ N, and state transitions correspond to vector addition. The similarities and differences with Vector Addition Systems and Petri nets are discussed. Symbolic dynamics are introduced on special partitions of configuration space and we indicate different notions of complexity for membrane systems based on this and related concepts such as graph complexity and minimal automata. Some examples of synchronized, pipelined dataflow computations are given and decompositions into functional subunits are briefly commented on.
APA, Harvard, Vancouver, ISO, and other styles
3

Stephenson, Matthew James. "A Differential Datalog Interpreter." Software 2, no. 3 (2023): 427–46. http://dx.doi.org/10.3390/software2030020.

Full text
Abstract:
The core reasoning task for datalog engines is materialization, the evaluation of a datalog program over a database alongside its physical incorporation into the database itself. The de-facto method of computing is through the recursive application of inference rules. Due to it being a costly operation, it is a must for datalog engines to provide incremental materialization; that is, to adjust the computation to new data instead of restarting from scratch. One of the major caveats is that deleting data is notoriously more involved than adding since one has to take into account all possible data that has been entailed from what is being deleted. Differential dataflow is a computational model that provides efficient incremental maintenance, notoriously with equal performance between additions and deletions, and work distribution of iterative dataflows. In this paper, we investigate the performance of materialization with three reference datalog implementations, out of which one is built on top of a lightweight relational engine, and the two others are differential-dataflow and non-differential versions of the same rewrite algorithm with the same optimizations. Experimental results suggest that monotonic aggregation is more powerful than ascenting merely the powerset lattice.
APA, Harvard, Vancouver, ISO, and other styles
4

Zmejev, D. N., and N. N. Levchenko. "Features of Creating Parallel Programs for the Parallel Dataflow Computing System "Buran"." Informacionnye Tehnologii 29, no. 10 (2023): 529–39. http://dx.doi.org/10.17587/it.29.529-539.

Full text
Abstract:
The use of modern multi-core and multi-processor computer systems for actual tasks is not effective enough, even taking into account the use of parallel programming technologies. The solution to the problem of efficient loading of computing resources can be the transition to computing models and architectures that are inherently parallel. One of such architectures is the Parallel Dataflow Computing System (PDCS) "Buran", which implements a dataflow computing model with a dynamically formed context. A feature of the dataflow computing model is the activation of computations by data readiness, which affects both the architecture of the computing system and the creation of programs for such systems. Differences between imperative and dataflow programming paradigms are also reflected in the route of creating a program, especially its parallel implementation. The route of creating a dataflow parallel program differs markedly from the traditional one. Already at the first stage, a parallel algorithm is created and implemented (including the algorithm for generating initial data). Next, this algorithm is debugged when it is executed on an emulator or model of the system with one computing core. After that, the selection and configuration of function of computation distribution is performed, the program is executed (without changing its code) on the emulator or model of the system with several computing cores, and, finally, the program is debugged in multi-core mode. These stages of the route differ from similar stages of the traditional route, both in form and in essence. Unlike traditional parallel, and even more so sequential programs, it can be said that a dataflow program in equal parts consists of a program code that implements a task, an algorithm for generating initial data, and a function of computation distribution. This thesis is demonstrated by the example of solving the problem of finding sum of array elements, where the difference between the implementations is in the algorithm for generating initial data, which radically affects the nature of the dataflow program passing. Careful attention to each of the parts of the dataflow program is the key to the correct and efficient solution of problems on the PDCS, which provides support to the programmer at the hardware level.
APA, Harvard, Vancouver, ISO, and other styles
5

Fischer, Andreas, Benny Fuhry, Jörn Kußmaul, Jonas Janneck, Florian Kerschbaum, and Eric Bodden. "Computation on Encrypted Data Using Dataflow Authentication." ACM Transactions on Privacy and Security 25, no. 3 (2022): 1–36. http://dx.doi.org/10.1145/3513005.

Full text
Abstract:
Encrypting data before sending it to the cloud ensures data confidentiality but requires the cloud to compute on encrypted data. Trusted execution environments, such as Intel SGX enclaves, promise to provide a secure environment in which data can be decrypted and then processed. However, vulnerabilities in the executed program give attackers ample opportunities to execute arbitrary code inside the enclave. This code can modify the dataflow of the program and leak secrets via SGX side channels. Fully homomorphic encryption would be an alternative to compute on encrypted data without data leaks. However, due to its high computational complexity, its applicability to general-purpose computing remains limited. Researchers have made several proposals for transforming programs to perform encrypted computations on less powerful encryption schemes. Yet current approaches do not support programs making control-flow decisions based on encrypted data. We introduce the concept of dataflow authentication (DFAuth) to enable such programs. DFAuth prevents an adversary from arbitrarily deviating from the dataflow of a program. Our technique hence offers protections against the side-channel attacks described previously. We implemented two flavors of DFAuth, a Java bytecode-to-bytecode compiler, and an SGX enclave running a small and program-independent trusted code base. We applied DFAuth to a neural network performing machine learning on sensitive medical data and a smart charging scheduler for electric vehicles. Our transformation yields a neural network with encrypted weights, which can be evaluated on encrypted inputs in \( 12.55 \,\mathrm{m}\mathrm{s} \) . Our protected scheduler is capable of updating the encrypted charging plan in approximately 1.06 seconds.
APA, Harvard, Vancouver, ISO, and other styles
6

Fischer, Andreas, Benny Fuhry, Florian Kerschbaum, and Eric Bodden. "Computation on Encrypted Data using Dataflow Authentication." Proceedings on Privacy Enhancing Technologies 2020, no. 1 (2020): 5–25. http://dx.doi.org/10.2478/popets-2020-0002.

Full text
Abstract:
AbstractEncrypting data before sending it to the cloud protects it against attackers, but requires the cloud to compute on encrypted data. Trusted modules, such as SGX enclaves, promise to provide a secure environment in which data can be decrypted and then processed. However, vulnerabilities in the executed program, which becomes part of the trusted code base (TCB), give attackers ample opportunity to execute arbitrary code inside the enclave. This code can modify the dataflow of the program and leak secrets via SGX side-channels. Since any larger code base is rife with vulnerabilities, it is not a good idea to outsource entire programs to SGX enclaves. A secure alternative relying solely on cryptography would be fully homomorphic encryption. However, due to its high computational complexity it is unlikely to be adopted in the near future. Researchers have made several proposals for transforming programs to perform encrypted computations on less powerful encryption schemes. Yet current approaches do not support programs making control-flow decisions based on encrypted data.We introduce the concept of dataflow authentication (DFAuth) to enable such programs. DFAuth prevents an adversary from arbitrarily deviating from the dataflow of a program. Our technique hence offers protections against the side-channel attacks described above. We implemented DFAuth using a novel authenticated homomorphic encryption scheme, a Java bytecode-tobytecode compiler producing fully executable programs, and an SGX enclave running a small and program-independent TCB. We applied DFAuth to an existing neural network that performs machine learning on sensitive medical data. The transformation yields a neural network with encrypted weights, which can be evaluated on encrypted inputs in 0.86 s.
APA, Harvard, Vancouver, ISO, and other styles
7

Levchenko, N. N., and D. N. Zmejev. "Dynamic Control of Computation Consistency in the Parallel Dataflow Computing System." Informacionnye Tehnologii 27, no. 12 (2021): 625–33. http://dx.doi.org/10.17587/it.27.625-633.

Full text
Abstract:
When developing high-performance multiprocessor computing systems, much attention is paid to ensuring uninterrupted operation, both in terms of hardware and software. In traditional computing systems, software is the main focus in address­ing these issues. The article discusses the solution to the issue of ensuring uninterrupted operation for the parallel dataflow computing system (PDCS), which implements the dataflow computational model with a dynamically formed context. Due to the features of the PDCS, it is proposed to implement this type of control in hardware, which will increase its efficiency, since the computational process will be controlled in dynamics, and not only in statics.
APA, Harvard, Vancouver, ISO, and other styles
8

Bouakaz, Adnan, Pascal Fradet, and Alain Girault. "A Survey of Parametric Dataflow Models of Computation." ACM Transactions on Design Automation of Electronic Systems 22, no. 2 (2017): 1–25. http://dx.doi.org/10.1145/2999539.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Prihozhy, A. A. "Generation of shortest path search dataflow networks of actors for parallel multi-core implementation." Informatics 20, no. 2 (2023): 65–84. http://dx.doi.org/10.37661/1816-0301-2023-20-2-65-84.

Full text
Abstract:
Objectives. The problem of parallelizing computations on multicore systems is considered. On the Floyd – Warshall blocked algorithm of shortest paths search in dense graphs of large size, two types of parallelism are compared: fork-join and network dataflow. Using the CAL programming language, a method of developing actors and an algorithm of generating parallel dataflow networks are proposed. The objective is to improve performance of parallel implementations of algorithms which have the property of partial order of computations on multicore processors.Methods. Methods of graph theory, algorithm theory, parallelization theory and formal language theory are used.Results. Claims about the possibility of reordering calculations in the blocked Floyd – Warshall algorithm are proved, which make it possible to achieve a greater load of cores during algorithm execution. Based on the claims, a method of constructing actors in the CAL language is developed and an algorithm for automatic generation of dataflow CAL networks for various configurations of block matrices describing the lengths of the shortest paths is proposed. It is proved that the networks have the properties of rate consistency, boundedness, and liveness. In actors running in parallel, the order of execution of actions with asynchronous behavior can change dynamically, resulting in efficient use of caches and increased core load. To implement the new features of actors, networks and the method of their generation, a tunable multi-threaded CAL engine has been developed that implements a static dataflow model of computation with bounded sizes of buffers. From the experimental results obtained on four types of multi-core processors it follows that there is an optimal size of the network matrix of actors for which the performance is maximum, and the size depends on the number of cores and the size of graph.Conclusion. It has been shown that dataflow networks of actors are an effective means to parallelize computationally intensive algorithms that describe a partial order of computations over decomposed data. The results obtained on the blocked algorithm of shortest paths search prove that the parallelism of dataflow networks gives higher performance of software implementations on multicore processors in comparison with the fork-join parallelism of OpenMP.
APA, Harvard, Vancouver, ISO, and other styles
10

Cheng, Wei-Kai, Xiang-Yi Liu, Hsin-Tzu Wu, Hsin-Yi Pai, and Po-Yao Chung. "Reconfigurable Architecture and Dataflow for Memory Traffic Minimization of CNNs Computation." Micromachines 12, no. 11 (2021): 1365. http://dx.doi.org/10.3390/mi12111365.

Full text
Abstract:
Computation of convolutional neural network (CNN) requires a significant amount of memory access, which leads to lots of energy consumption. As the increase of neural network scale, this phenomenon is further obvious, the energy consumption of memory access and data migration between on-chip buffer and off-chip DRAM is even much more than the computation energy on processing element array (PE array). In order to reduce the energy consumption of memory access, a better dataflow to maximize data reuse and minimize data migration between on-chip buffer and external DRAM is important. Especially, the dimension of input feature map (ifmap) and filter weight are much different for each layer of the neural network. Hardware resources may not be effectively utilized if the array architecture and dataflow cannot be reconfigured layer by layer according to their ifmap dimension and filter dimension, and result in a large quantity of data migration on certain layers. However, a thorough exploration of all possible configurations is time consuming and meaningless. In this paper, we propose a quick and efficient methodology to adapt the configuration of PE array architecture, buffer assignment, dataflow and reuse methodology layer by layer with the given CNN architecture and hardware resource. In addition, we make an exploration on the different combinations of configuration issues to investigate their effectiveness and can be used as a guide to speed up the thorough exploration process.
APA, Harvard, Vancouver, ISO, and other styles
11

FERLIN, EDSON PEDRO, HEITOR SILVÉRIO LOPES, CARLOS R. ERIG LIMA, and MAURÍCIO PERRETTO. "A FPGA-BASED RECONFIGURABLE PARALLEL ARCHITECTURE FOR HIGH-PERFORMANCE NUMERICAL COMPUTATION." Journal of Circuits, Systems and Computers 20, no. 05 (2011): 849–65. http://dx.doi.org/10.1142/s0218126611007645.

Full text
Abstract:
Many real-world engineering problems require high computational power, especially regarding the processing time. Current parallel processing techniques play an important role in reducing the processing time. Recently, reconfigurable computation has gained large attention thanks to its ability to combine hardware performance and software flexibility. Also, the availability of high-density Field Programmable Gate Array devices and corresponding development systems allowed the popularization of reconfigurable computation, encouraging the development of very complex, compact, and powerful systems for custom applications. This work presents an architecture for parallel reconfigurable computation based on the dataflow concept. This architecture allows reconfigurability of the system for many problems and, particularly, for numerical computation. Several experiments were done analyzing the scalability of the architecture, as well as comparing its performance with other approaches. Overall results are relevant and promising. The developed architecture has performance and scalability suited for engineering problems that demand intensive numerical computation.
APA, Harvard, Vancouver, ISO, and other styles
12

Páli, Gábor. "Declarative scheduling of dataflow networks." Annales Universitatis Scientiarum Budapestinensis de Rolando Eötvös Nominatae. Sectio computatorica, no. 37 (2012): 311–38. https://doi.org/10.71352/ac.37.311.

Full text
Abstract:
It is common for domain-specific applications to be supported with a specialized model of computation. In the domain of digital signal processing, dataflow networks are commonly employed for describing such systems, and therefore to specify a way of execution. We have created a model using a pure functional programming language, Haskell, to capture such applications on a higher level that may be used to generate programs. However, the run-time performance of the resulting code does not yet meet the performance requirements of the field. We believe that the situation may be improved by providing tools for the application programmer to control how the program in question is scheduled and executed without worrying about with the low-level and error-prone details too much.
APA, Harvard, Vancouver, ISO, and other styles
13

Ghosal, D. S., S. K. Tripathi, L. N. Bhuyan, and H. Jiang. "Analysis of computation-communication issues in dynamic dataflow architectures." ACM SIGARCH Computer Architecture News 17, no. 3 (1989): 325–33. http://dx.doi.org/10.1145/74926.74962.

Full text
APA, Harvard, Vancouver, ISO, and other styles
14

Ghosh, Sukumar, Somprakash Bandyopadhyay, and Chandan Mazumdar. "Study of a simulated stream machine for dataflow computation." Performance Evaluation 6, no. 4 (1986): 269–91. http://dx.doi.org/10.1016/0166-5316(86)90036-2.

Full text
APA, Harvard, Vancouver, ISO, and other styles
15

Sterling, T. L., and J. M. Arnold. "Fine Grain Dataflow Computation without Tokens for Balanced Execution." Journal of Parallel and Distributed Computing 18, no. 3 (1993): 327–39. http://dx.doi.org/10.1006/jpdc.1993.1068.

Full text
APA, Harvard, Vancouver, ISO, and other styles
16

Jagode, Heike, Anthony Danalis, and Jack Dongarra. "Accelerating NWChem Coupled Cluster through dataflow-based execution." International Journal of High Performance Computing Applications 32, no. 4 (2017): 540–51. http://dx.doi.org/10.1177/1094342016672543.

Full text
Abstract:
Numerical techniques used for describing many-body systems, such as the Coupled Cluster methods (CC) of the quantum chemistry package NWChem, are of extreme interest to the computational chemistry community in fields such as catalytic reactions, solar energy, and bio-mass conversion. In spite of their importance, many of these computationally intensive algorithms have traditionally been thought of in a fairly linear fashion, or are parallelized in coarse chunks. In this paper, we present our effort of converting the NWChem’s CC code into a dataflow-based form that is capable of utilizing the task scheduling system PaRSEC (Parallel Runtime Scheduling and Execution Controller): a software package designed to enable high-performance computing at scale. We discuss the modularity of our approach and explain how the PaRSEC-enabled dataflow version of the subroutines seamlessly integrate into the NWChem codebase. Furthermore, we argue how the CC algorithms can be easily decomposed into finer-grained tasks (compared with the original version of NWChem); and how data distribution and load balancing are decoupled and can be tuned independently. We demonstrate performance acceleration by more than a factor of two in the execution of the entire CC component of NWChem, concluding that the utilization of dataflow-based execution for CC methods enables more efficient and scalable computation.
APA, Harvard, Vancouver, ISO, and other styles
17

Zhou, Chijin, Bingzhou Qian, Gwihwan Go, Quan Zhang, Shanshan Li, and Yu Jiang. "PolyJuice: Detecting Mis-compilation Bugs in Tensor Compilers with Equality Saturation Based Rewriting." Proceedings of the ACM on Programming Languages 8, OOPSLA2 (2024): 1309–35. http://dx.doi.org/10.1145/3689757.

Full text
Abstract:
Tensor compilers are essential for deploying deep learning applications across various hardware platforms. While powerful, they are inherently complex and present significant challenges in ensuring correctness. This paper introduces PolyJuice, an automatic detection tool for identifying mis-compilation bugs in tensor compilers. Its basic idea is to construct semantically-equivalent computation graphs to validate the correctness of tensor compilers. The main challenge is to construct equivalent graphs capable of efficiently exploring the diverse optimization logic during compilation. We approach it from two dimensions. First, we propose arithmetic and structural equivalent rewrite rules to modify the dataflow of a tensor program. Second, we design an efficient equality saturation based rewriting framework to identify the most simplified and the most complex equivalent computation graphs for an input graph. After that, the outcome computation graphs have different dataflow and will likely experience different optimization processes during compilation. We applied it to five well-tested industrial tensor compilers, namely PyTorch Inductor, OnnxRuntime, TVM, TensorRT, and XLA, as well as two well-maintained academic tensor compilers, EinNet and Hidet. In total, PolyJuice detected 84 non-crash mis-compilation bugs, out of which 49 were confirmed with 20 fixed.
APA, Harvard, Vancouver, ISO, and other styles
18

Tsoeunyane, Lekhobola, Simon Winberg, and Michael Inggs. "Automatic Configurable Hardware Code Generation for Software-Defined Radios." Computers 7, no. 4 (2018): 53. http://dx.doi.org/10.3390/computers7040053.

Full text
Abstract:
The development of software-defined radio (SDR) systems using field-programmable gate arrays (FPGAs) compels designers to reuse pre-existing Intellectual Property (IP) cores in order to meet time-to-market and design efficiency requirements. However, the low-level development difficulties associated with FPGAs hinder productivity, even when the designer is experienced with hardware design. These low-level difficulties include non-standard interfacing methods, component communication and synchronization challenges, complicated timing constraints and processing blocks that need to be customized through time-consuming design tweaks. In this paper, we present a methodology for automated and behavioral integration of dedicated IP cores for rapid prototyping of SDR applications. To maintain high performance of the SDR designs, our methodology integrates IP cores using characteristics of the dataflow model of computation (MoC), namely the static dataflow with access patterns (SDF-AP). We show how the dataflow is mapped onto the low-level model of hardware by efficiently applying low-level based optimizations and using a formal analysis technique that guarantees the correctness of the generated solutions. Furthermore, we demonstrate the capability of our automated hardware design approach by developing eight SDR applications in VHDL. The results show that well-optimized designs are generated and that this can improve productivity while also conserving the hardware resources used.
APA, Harvard, Vancouver, ISO, and other styles
19

Alam, Shahanur, Chris Yakopcic, Qing Wu, Mark Barnell, Simon Khan, and Tarek M. Taha. "Survey of Deep Learning Accelerators for Edge and Emerging Computing." Electronics 13, no. 15 (2024): 2988. http://dx.doi.org/10.3390/electronics13152988.

Full text
Abstract:
The unprecedented progress in artificial intelligence (AI), particularly in deep learning algorithms with ubiquitous internet connected smart devices, has created a high demand for AI computing on the edge devices. This review studied commercially available edge processors, and the processors that are still in industrial research stages. We categorized state-of-the-art edge processors based on the underlying architecture, such as dataflow, neuromorphic, and processing in-memory (PIM) architecture. The processors are analyzed based on their performance, chip area, energy efficiency, and application domains. The supported programming frameworks, model compression, data precision, and the CMOS fabrication process technology are discussed. Currently, most commercial edge processors utilize dataflow architectures. However, emerging non-von Neumann computing architectures have attracted the attention of the industry in recent years. Neuromorphic processors are highly efficient for performing computation with fewer synaptic operations, and several neuromorphic processors offer online training for secured and personalized AI applications. This review found that the PIM processors show significant energy efficiency and consume less power compared to dataflow and neuromorphic processors. A future direction of the industry could be to implement state-of-the-art deep learning algorithms in emerging non-von Neumann computing paradigms for low-power computing on edge devices.
APA, Harvard, Vancouver, ISO, and other styles
20

Cheng, Xiaoshu, Yiwen Wang, Weiran Ding, Hongfei Lou, and Ping Li. "Leveraging Bit-Serial Architectures for Hardware-Oriented Deep Learning Accelerators with Column-Buffering Dataflow." Electronics 13, no. 7 (2024): 1217. http://dx.doi.org/10.3390/electronics13071217.

Full text
Abstract:
Bit-serial neural network accelerators address the growing need for compact and energy-efficient deep learning tools. Traditional neural network accelerators, while effective, often grapple with issues of size, power consumption, and versatility in handling a variety of computational tasks. To counter these challenges, this paper introduces an approach that hinges on the integration of bit-serial processing with advanced dataflow techniques and architectural optimizations. Central to this approach is a column-buffering (CB) dataflow, which significantly reduces access and movement requirements for the input feature map (IFM), thereby enhancing efficiency. Moreover, a simplified quantization process effectively eliminates biases, streamlining the overall computation process. Furthermore, this paper presents a meticulously designed LeNet-5 accelerator leveraging a convolutional layer processing element array (CL PEA) architecture incorporating an improved bit-serial multiply–accumulate unit (MAC). Empirically, our work demonstrates superior performance in terms of frequency, chip area, and power consumption compared to current state-of-the-art ASIC designs. Specifically, our design utilizes fewer hardware resources to implement a complete accelerator, achieving a high performance of 7.87 GOPS on a Xilinx Kintex-7 FPGA with a brief processing time of 284.13 μs. The results affirm that our design is exceptionally suited for applications requiring compact, low-power, and real-time solutions.
APA, Harvard, Vancouver, ISO, and other styles
21

Ingersoll, Segreen, and Sotirios G. Ziavras. "Dataflow computation with intelligent memories emulated on field-programmable gate arrays (FPGAs)." Microprocessors and Microsystems 26, no. 6 (2002): 263–80. http://dx.doi.org/10.1016/s0141-9331(02)00038-8.

Full text
APA, Harvard, Vancouver, ISO, and other styles
22

Ammar, Khaled, Siddhartha Sahu, Semih Salihoglu, and M. Tamer Özsu. "Optimizing differentially-maintained recursive queries on dynamic graphs." Proceedings of the VLDB Endowment 15, no. 11 (2022): 3186–98. http://dx.doi.org/10.14778/3551793.3551862.

Full text
Abstract:
Differential computation (DC) is a highly general incremental computation/view maintenance technique that can maintain the output of an arbitrary and possibly recursive dataflow computation upon changes to its base inputs. As such, it is a promising technique for graph database management systems (GDBMS) that support continuous recursive queries over dynamic graphs. Although differential computation can be highly efficient for maintaining these queries, it can require prohibitively large amount of memory. This paper studies how to reduce the memory overhead of DC with the goal of increasing the scalability of systems that adopt it. We propose a suite of optimizations that are based on dropping the differences of operators, both completely or partially, and recomputing these differences when necessary. We propose deterministic and probabilistic data structures to keep track of the dropped differences. Extensive experiments demonstrate that the optimizations can improve the scalability of a DC-based continuous query processor.
APA, Harvard, Vancouver, ISO, and other styles
23

Laddad, Shadaj, Alvin Cheung, Joseph M. Hellerstein, and Mae Milano. "Flo: A Semantic Foundation for Progressive Stream Processing." Proceedings of the ACM on Programming Languages 9, POPL (2025): 241–70. https://doi.org/10.1145/3704845.

Full text
Abstract:
Streaming systems are present throughout modern applications, processing continuous data in real-time. Existing streaming languages have a variety of semantic models and guarantees that are often incompatible. Yet all these languages are considered "streaming"---what do they have in common? In this paper, we identify two general yet precise semantic properties: streaming progress and eager execution. Together, they ensure that streaming outputs are deterministic and kept fresh with respect to streaming inputs. We formally define these properties in the context of Flo, a parameterized streaming language that abstracts over dataflow operators and the underlying structure of streams. It leverages a lightweight type system to distinguish bounded streams, which allow operators to block on termination, from unbounded ones. Furthermore, Flo provides constructs for dataflow composition and nested graphs with cycles. To demonstrate the generality of our properties, we show how key ideas from representative streaming and incremental computation systems---Flink, LVars, and DBSP---have semantics that can be modeled in Flo and guarantees that map to our properties.
APA, Harvard, Vancouver, ISO, and other styles
24

Li, Baoting, Hang Wang, Xuchong Zhang, et al. "Dynamic Dataflow Scheduling and Computation Mapping Techniques for Efficient Depthwise Separable Convolution Acceleration." IEEE Transactions on Circuits and Systems I: Regular Papers 68, no. 8 (2021): 3279–92. http://dx.doi.org/10.1109/tcsi.2021.3078541.

Full text
APA, Harvard, Vancouver, ISO, and other styles
25

Gao, Guang R. "Maximum Pipelining Of Array Computation: A Pipelined Code Mapping Scheme For Dataflow Computers." INFOR: Information Systems and Operational Research 27, no. 2 (1989): 145–72. http://dx.doi.org/10.1080/03155986.1989.11732089.

Full text
APA, Harvard, Vancouver, ISO, and other styles
26

Yang, Zhao Hong, Qing Xiao, Yun Zhan Gong, Da Hai Jin, and Ya Wen Wang. "The Research of an Abstract Semantic Framework for Defect Detecting." Advanced Materials Research 186 (January 2011): 536–40. http://dx.doi.org/10.4028/www.scientific.net/amr.186.536.

Full text
Abstract:
This paper proposes a non-relational abstract semantic framework. It uses interval set to represent the value of numerical variables and complete lattice to represent Boolean variables and reference variables. It presents the abstract computation method of basic expressions and the nodes of control flow graph. It uses function summaries to represent the context information of function call needed by defects detecting. Based on the results of abstract computation, it uses extended state machine to define defect patterns and proposes a path-sensitive method based on dataflow analysis to detect defects. It avoids the combination explosion of full path analysis by merging the conditions of identical property state at join points in the CFG. Practical test results show that the proposed methods have features of high efficiency, low false positive and low false negative.
APA, Harvard, Vancouver, ISO, and other styles
27

Liu, Yang, Yiheng Zhang, Xiaoran Hao, et al. "Design of a Convolutional Neural Network Accelerator Based on On-Chip Data Reordering." Electronics 13, no. 5 (2024): 975. http://dx.doi.org/10.3390/electronics13050975.

Full text
Abstract:
Convolutional neural networks have been widely applied in the field of computer vision. In convolutional neural networks, convolution operations account for more than 90% of the total computational workload. The current mainstream approach to achieving high energy-efficient convolution operations is through dedicated hardware accelerators. Convolution operations involve a significant amount of weights and input feature data. Due to limited on-chip cache space in accelerators, there is a significant amount of off-chip DRAM memory access involved in the computation process. The latency of DRAM access is 20 times higher than that of SRAM, and the energy consumption of DRAM access is 100 times higher than that of multiply–accumulate (MAC) units. It is evident that the “memory wall” and “power wall” issues in neural network computation remain challenging. This paper presents the design of a hardware accelerator for convolutional neural networks. It employs a dataflow optimization strategy based on on-chip data reordering. This strategy improves on-chip data utilization and reduces the frequency of data exchanges between on-chip cache and off-chip DRAM. The experimental results indicate that compared to the accelerator without this strategy, it can reduce data exchange frequency by up to 82.9%.
APA, Harvard, Vancouver, ISO, and other styles
28

Kavi, Krishna M., and Akshay K. Deshpande. "Specification of concurrent processes using a dataflow model of computation and partially ordered events." Journal of Systems and Software 16, no. 2 (1991): 107–20. http://dx.doi.org/10.1016/0164-1212(91)90004-p.

Full text
APA, Harvard, Vancouver, ISO, and other styles
29

Guliyev, Rustam, Aparajita Haldar, and Hakan Ferhatosmanoglu. "D3-GNN: Dynamic Distributed Dataflow for Streaming Graph Neural Networks." Proceedings of the VLDB Endowment 17, no. 11 (2024): 2764–77. http://dx.doi.org/10.14778/3681954.3681961.

Full text
Abstract:
Graph Neural Network (GNN) models on streaming graphs entail algorithmic challenges to continuously capture its dynamic state, as well as systems challenges to optimize latency, memory, and throughput during both inference and training. We present D3-GNN, the first distributed, hybrid-parallel, streaming GNN system designed to handle real-time graph updates under online query setting. Our system addresses data management, algorithmic, and systems challenges, enabling continuous capturing of the dynamic state of the graph and updating node representations with fault-tolerance and optimal latency, load-balance, and throughput. D3-GNN utilizes streaming GNN aggregators and an unrolled, distributed computation graph architecture to handle cascading graph updates. To counteract data skew and neighborhood explosion issues, we introduce inter-layer and intra-layer windowed forward pass solutions. Experiments on large-scale graph streams demonstrate that D3-GNN achieves high efficiency and scalability. Compared to DGL, D3-GNN achieves a significant throughput improvement of about 76x for streaming workloads. The windowed enhancement further reduces running times by around 10x and message volumes by up to 15x at higher parallelism.
APA, Harvard, Vancouver, ISO, and other styles
30

Bloch, Aurelien, Simone Casale-Brunet, and Marco Mattavelli. "Performance Estimation of High-Level Dataflow Program on Heterogeneous Platforms by Dynamic Network Execution." Journal of Low Power Electronics and Applications 12, no. 3 (2022): 36. http://dx.doi.org/10.3390/jlpea12030036.

Full text
Abstract:
The performance of programs executed on heterogeneous parallel platforms largely depends on the design choices regarding how to partition the processing on the various different processing units. In other words, it depends on the assumptions and parameters that define the partitioning, mapping, scheduling, and allocation of data exchanges among the various processing elements of the platform executing the program. The advantage of programs written in languages using the dataflow model of computation (MoC) is that executing the program with different configurations and parameter settings does not require rewriting the application software for each configuration setting, but only requires generating a new synthesis of the execution code corresponding to different parameters. The synthesis stage of dataflow programs is usually supported by automatic code generation tools. Another competitive advantage of dataflow software methodologies is that they are well-suited to support designs on heterogeneous parallel systems as they are inherently free of memory access contention issues and naturally expose the available intrinsic parallelism. So as to fully exploit these advantages and to be able to efficiently search the configuration space to find the design points that better satisfy the desired design constraints, it is necessary to develop tools and associated methodologies capable of evaluating the performance of different configurations and to drive the search for good design configurations, according to the desired performance criteria. The number of possible design assumptions and associated parameter settings is usually so large (i.e., the dimensions and size of the design space) that intuition as well as trial and error are clearly unfeasible, inefficient approaches. This paper describes a method for the clock-accurate profiling of software applications developed using the dataflow programming paradigm such as the formal RVL-CAL language. The profiling can be applied when the application program has been compiled and executed on GPU/CPU heterogeneous hardware platforms utilizing two main methodologies, denoted as static and dynamic. This paper also describes how a method for the qualitative evaluation of the performance of such programs as a function of the supplied configuration parameters can be successfully applied to heterogeneous platforms. The technique was illustrated using two different application software examples and several design points.
APA, Harvard, Vancouver, ISO, and other styles
31

Valpreda, Emanuele, Pierpaolo Morì, Nael Fasfous, et al. "HW-Flow-Fusion: Inter-Layer Scheduling for Convolutional Neural Network Accelerators with Dataflow Architectures." Electronics 11, no. 18 (2022): 2933. http://dx.doi.org/10.3390/electronics11182933.

Full text
Abstract:
Energy and throughput efficient acceleration of convolutional neural networks (CNN) on devices with a strict power budget is achieved by leveraging different scheduling techniques to minimize data movement and maximize data reuse. Several dataflow mapping frameworks have been developed to explore the optimal scheduling of CNN layers on reconfigurable accelerators. However, previous works usually optimize each layer singularly, without leveraging the data reuse between the layers of CNNs. In this work, we present an analytical model to achieve efficient data reuse by searching for efficient scheduling of communication and computation across layers. We call this inter-layer scheduling framework HW-Flow-Fusion, as we explore the fused map-space of multiple layers sharing the available resources of the same accelerator, investigating the constraints and trade-offs of mapping the execution of multiple workloads with data dependencies. We propose a memory-efficient data reuse model, tiling, and resource partitioning strategies to fuse multiple layers without recomputation. Compared to standard single-layer scheduling, inter-layer scheduling can reduce the communication volume by 51% and 53% for selected VGG16-E and ResNet18 layers on a spatial array accelerator, and reduce the latency by 39% and 34% respectively, while also increasing the computation to communication ratio which improves the memory bandwidth efficiency.
APA, Harvard, Vancouver, ISO, and other styles
32

Lankala, Srinija, and Dr M. Ramana Reddy. "Design and Implementation of Energy-Efficient Floating Point MFCC Extraction Architecture for Speech Recognition Systems." International Journal for Research in Applied Science and Engineering Technology 10, no. 9 (2022): 1217–25. http://dx.doi.org/10.22214/ijraset.2022.46807.

Full text
Abstract:
Abstract: This brief presents an energy-efficient architecture to extract mel-frequency cepstrum coefficients (MFCCs) for realtime speech recognition systems. Based on the algorithmic property of MFCC feature extraction, the architecture is designed with floating-point arithmetic units to cover a wide dynamic range with a small bit-width. Moreover, various operations required in the MFCC extraction are examined to optimize operational bit-width and lookup tables needed to compute nonlinear functions, such as trigonometric and logarithmic functions. In addition, the dataflow of MFCC extraction is tailored to minimize the computation time. As a result, the energy consumption is considerably reduced compared with previous MFCC extraction systems
APA, Harvard, Vancouver, ISO, and other styles
33

Liu, Peng, and Yu Wang. "A Low-Power General Matrix Multiplication Accelerator with Sparse Weight-and-Output Stationary Dataflow." Micromachines 16, no. 1 (2025): 101. https://doi.org/10.3390/mi16010101.

Full text
Abstract:
General matrix multiplication (GEMM) in machine learning involves massive computation and data movement, which restricts its deployment on resource-constrained devices. Although data reuse can reduce data movement during GEMM processing, current approaches fail to fully exploit its potential. This work introduces a sparse GEMM accelerator with a weight-and-output stationary (WOS) dataflow and a distributed buffer architecture. It processes GEMM in a compressed format and eliminates on-chip transfers of both weights and partial sums. Furthermore, to map the compressed GEMM of various sizes onto the accelerator, an adaptable mapping scheme is designed. However, the irregular sparsity of weight matrices makes it difficult to store them in local buffers with the compressed format; denser vectors can exceed the buffer capacity, while sparser vectors may lead to the underutilization of buffers. To address this complication, this work also proposes an offline sparsity-aware shuffle strategy for weights, which balances the utilization of distributed buffers and minimizes buffer waste. Finally, a low-cost sparse computing method is applied to the WOS dataflow with globally shared inputs to achieve high computing throughput. Experiments with an FPGA show that the proposed accelerator achieves 1.73× better computing efficiency and 1.36× higher energy efficiency than existing approaches.
APA, Harvard, Vancouver, ISO, and other styles
34

Konečný, Michal, and Amin Farjudian. "Compositional Semantics of Dataflow Networks with Query-Driven Communication of Exact Values." JUCS - Journal of Universal Computer Science 16, no. (18) (2010): 2629–56. https://doi.org/10.3217/jucs-016-18-2629.

Full text
Abstract:
We develop and study the concept of dataflow process networks as used for exampleby Kahn to suit exact computation over data types related to real numbers, such as continuous functions and geometrical solids. Furthermore, we consider communicating these exact objectsamong processes using protocols of a query-answer nature as introduced in our earlier work. This enables processes to provide valid approximations with certain accuracy and focusing on certainlocality as demanded by the receiving processes through queries. We define domain-theoretical denotational semantics of our networks in two ways: (1) directly, i. e. by viewing the whole network as a composite process and applying the process semantics introduced in our earlier work; and (2) compositionally, i. e. by a fixed-point construction similarto that used by Kahn from the denotational semantics of individual processes in the network. The direct semantics closely corresponds to the operational semantics of the network (i. e. it iscorrect) but very difficult to study for concrete networks. The compositional semantics enablescompositional analysis of concrete networks, assuming it is correct. We prove that the compositional semantics is a safe approximation of the direct semantics. Wealso provide a method that can be used in many cases to establish that the two semantics fully coincide, i. e. safety is not achieved through inactivity or meaningless answers. The results are extended to cover recursively-defined infinite networks as well as nested finitenetworks. A robust prototype implementation of our model is available.
APA, Harvard, Vancouver, ISO, and other styles
35

Summers, Sioni, and Andrew Rose. "Kalman Filter track reconstruction on FPGAs for acceleration of the High Level Trigger of the CMS experiment at the HL-LHC." EPJ Web of Conferences 214 (2019): 01003. http://dx.doi.org/10.1051/epjconf/201921401003.

Full text
Abstract:
Track reconstruction at the CMS experiment uses the Combinatorial Kalman Filter. The algorithm computation time scales exponentially with pileup, which will pose a problem for the High Level Trigger at the High Luminosity LHC. FPGAs, which are already used extensively in hardware triggers, are becoming more widely used for compute acceleration. With a combination of high performance, energy efficiency, and predictable and low latency, FPGA accelerators are an interesting technology for high energy physics. Here, progress towards porting of the CMS track reconstruction to Maxeler Technologies’ Dataflow Engines is shown, programmed with their high level language MaxJ. The performance is compared to CPUs, and further steps to optimise for the architecture are presented.
APA, Harvard, Vancouver, ISO, and other styles
36

LIU, HAI, ERIC CHENG, and PAUL HUDAK. "Causal commutative arrows." Journal of Functional Programming 21, no. 4-5 (2011): 467–96. http://dx.doi.org/10.1017/s0956796811000153.

Full text
Abstract:
AbstractArrows are a popular form of abstract computation. Being more general than monads, they are more broadly applicable, and, in particular, are a good abstraction for signal processing and dataflow computations. Most notably, arrows form the basis for a domain-specific language called Yampa, which has been used in a variety of concrete applications, including animation, robotics, sound synthesis, control systems, and graphical user interfaces. Our primary interest is in better understanding the class of abstract computations captured by Yampa. Unfortunately, arrows are not concrete enough to do this with precision. To remedy this situation, we introduce the concept of commutative arrows that capture a noninterference property of concurrent computations. We also add an init operator that captures the causal nature of arrow effects, and identify its associated law. To study this class of computations in more detail, we define an extension to arrows called causal commutative arrows (CCA), and study its properties. Our key contribution is the identification of a normal form for CCA called causal commutative normal form (CCNF). By defining a normalization procedure, we have developed an optimization strategy that yields dramatic improvements in performance over conventional implementations of arrows. We have implemented this technique in Haskell, and conducted benchmarks that validate the effectiveness of our approach. When compiled with the Glasgow Haskell Compiler (GHC), the overall methodology can result in significant speedups.
APA, Harvard, Vancouver, ISO, and other styles
37

Huang, Shanshi, Xiaoyu Sun, Xiaochen Peng, Hongwu Jiang, and Shimeng Yu. "Achieving High In Situ Training Accuracy and Energy Efficiency with Analog Non-Volatile Synaptic Devices." ACM Transactions on Design Automation of Electronic Systems 27, no. 4 (2022): 1–19. http://dx.doi.org/10.1145/3500929.

Full text
Abstract:
On-device embedded artificial intelligence prefers the adaptive learning capability when deployed in the field, and thus in situ training is required. The compute-in-memory approach, which exploits the analog computation within the memory array, is a promising solution for deep neural network (DNN) on-chip acceleration. Emerging non-volatile memories are of great interest, serving as analog synapses due to their multilevel programmability. However, the asymmetry and nonlinearity in the conductance tuning remain grand challenges for achieving high in situ training accuracy. In addition, analog-to-digital converters at the edge of the memory array introduce quantization errors. In this work, we present an algorithm-hardware co-optimization to overcome these challenges. We incorporate the device/circuit non-ideal effects into the DNN propagation and weight update steps. By introducing the adaptive “momentum” in the weight update rule, in situ training accuracy on CIFAR-10 could approach its software baseline even under severe asymmetry/nonlinearity and analog-to-digital converter quantization error. The hardware performance of the on-chip training architecture and the overhead for adding “momentum” are also evaluated. By optimizing the backpropagation dataflow, 23.59 TOPS/W training energy efficiency (12× improvement compared to naïve dataflow) is achieved. The circuits that handle “momentum” introduce only 4.2% energy overhead. Our results show great potential and more relaxed requirements that enable emerging non-volatile memories for DNN acceleration on the embedded artificial intelligence platforms.
APA, Harvard, Vancouver, ISO, and other styles
38

Wu, Yao, Long Zheng, Brian Heilig, and Guang R. Gao. "HAMR: A dataflow-based real-time in-memory cluster computing engine." International Journal of High Performance Computing Applications 31, no. 5 (2016): 361–74. http://dx.doi.org/10.1177/1094342016672080.

Full text
Abstract:
As the attention given to big data grows, cluster computing systems for distributed processing of large data sets become the mainstream and critical requirement in high performance distributed system research. One of the most successful systems is Hadoop, which uses MapReduce as a programming/execution model and takes disks as intermedia to process huge volumes of data. Spark, as an in-memory computing engine, can solve the iterative and interactive problems more efficiently. However, currently it is a consensus that they are not the final solutions to big data due to a MapReduce-like programming model, synchronous execution model and the constraint that only supports batch processing, and so on. A new solution, especially, a fundamental evolution is needed to bring big data solutions into a new era. In this paper, we introduce a new cluster computing system called HAMR which supports both batch and streaming processing. To achieve better performance, HAMR integrates high performance computing approaches, i.e. dataflow fundamental into a big data solution. With more specifications, HAMR is fully designed based on in-memory computing to reduce the unnecessary disk access overhead; task scheduling and memory management are in fine-grain manner to explore more parallelism; asynchronous execution improves efficiency of computation resource usage, and also makes workload balance across the whole cluster better. The experimental results show that HAMR can outperform Hadoop MapReduce and Spark by up to 19x and 7x respectively, in the same cluster environment. Furthermore, HAMR can handle scaling data size well beyond the capabilities of Spark.
APA, Harvard, Vancouver, ISO, and other styles
39

Sinha, Amitabha, and Dhruba Basu. "A RECONFIGURABLE DATA FLOW ARCHITECTURE FOR SIGNAL PROCESSING APPLICATIONS." SYNCHROINFO JOURNAL 7, no. 5 (2021): 2–6. http://dx.doi.org/10.36724/2664-066x-2021-7-5-2-6.

Full text
Abstract:
This paper aims to devise a data flow model of computation for signal processing applications in which the operational nodes are signal/image processing functions such as Pixsum, Edge, Smooth. These functions are configured during run time from a pool of reconfigurable FPGAS. Thus because of the data flow model of computation, the signal processing functions execute concurrently. At the same time, these functions by exploiting their inherent spatial parallelism execute at high speed. There is a two fold speed up in the execution of image/signal processing applications one at the architecture level wherein a node of the dataflow model executes a digital signal processing (DSP) function rather than a low level machine operation. The second speed up is due to the fact that each DSP function is configured to execute in an FPGA by using maximally the concurrent operations that such a function permits. Another significant benefit that arises from our proposed architecture is that by reconfiguring an FPGA for a DSP function at run time, the reusability of the hardware elements results in reduced cost of operations. In this paper we provide an outline of the data flow architecture and its operational aspects.
APA, Harvard, Vancouver, ISO, and other styles
40

Wang, Jin Lin. "Scheduling of Periodic Tasks with Data Dependency on Multiprocessors." Advanced Materials Research 756-759 (September 2013): 2131–36. http://dx.doi.org/10.4028/www.scientific.net/amr.756-759.2131.

Full text
Abstract:
This article studies the scheduling problem of a set of tasks with time or data constraints on a number of identical processors with full connections. We present an algorithm, in which a set of static schedule lists can be obtained, each for a processor, such that each task starts executing after its release time and completes its computation before its deadline, and all the precedence relations between tasks resulting from data dependency are satisfied. The data dependency relations between tasks are represented by Synchronous Dataflow Graphs (SDF) as they can indicate tasks concurrency and enable effective scheduling on multiprocessor platforms. The SDF, however, does not support the time constraints of tasks directly, thus an adaption is applied to conform to the time limits. With this adaption, the periodic tasks of implicit-deadline or constrained-deadline can be scheduled on multiprocessor platform effectively.
APA, Harvard, Vancouver, ISO, and other styles
41

Uustalu, Tarmo, and Tarmo Vene. "Signals and Comonads." JUCS - Journal of Universal Computer Science 11, no. (7) (2005): 1310–26. https://doi.org/10.3217/jucs-011-07-1311.

Full text
Abstract:
We propose a novel discipline for programming stream functions and for the semantic description of stream manipulation languages based on the observation that both general and causal stream functions can be characterized as coKleisli arrows of comonads. This seems to be a promising application for the old, but very little exploited idea that if monads abstract notions of computation of a value, comonads ought to be useable as an abstraction of notions of value in a context. We also show that causal partial-stream functions can be described in terms of a combination of a comonad and a monad.
APA, Harvard, Vancouver, ISO, and other styles
42

Sabogal, Sebastian, Alan George, and Gary Crum. "Reconfigurable Framework for Resilient Semantic Segmentation for Space Applications." ACM Transactions on Reconfigurable Technology and Systems 14, no. 4 (2021): 1–32. http://dx.doi.org/10.1145/3472770.

Full text
Abstract:
Deep learning (DL) presents new opportunities for enabling spacecraft autonomy, onboard analysis, and intelligent applications for space missions. However, DL applications are computationally intensive and often infeasible to deploy on radiation-hardened (rad-hard) processors, which traditionally harness a fraction of the computational capability of their commercial-off-the-shelf counterparts. Commercial FPGAs and system-on-chips present numerous architectural advantages and provide the computation capabilities to enable onboard DL applications; however, these devices are highly susceptible to radiation-induced single-event effects (SEEs) that can degrade the dependability of DL applications. In this article, we propose Reconfigurable ConvNet (RECON), a reconfigurable acceleration framework for dependable, high-performance semantic segmentation for space applications. In RECON, we propose both selective and adaptive approaches to enable efficient SEE mitigation. In our selective approach, control-flow parts are selectively protected by triple-modular redundancy to minimize SEE-induced hangs, and in our adaptive approach, partial reconfiguration is used to adapt the mitigation of dataflow parts in response to a dynamic radiation environment. Combined, both approaches enable RECON to maximize system performability subject to mission availability constraints. We perform fault injection and neutron irradiation to observe the susceptibility of RECON and use dependability modeling to evaluate RECON in various orbital case studies to demonstrate a 1.5–3.0× performability improvement in both performance and energy efficiency compared to static approaches.
APA, Harvard, Vancouver, ISO, and other styles
43

THORNTON, MITCHELL. "Performance Evaluation of a Parallel Decoupled Data Driven Multiprocessor." Parallel Processing Letters 13, no. 03 (2003): 497–507. http://dx.doi.org/10.1142/s0129626403001458.

Full text
Abstract:
The Decoupled Data-Driven (D3) architecture has shown promising results from performance evaluations based upon deterministic simulations. This paper provides performance evaluations of the D3 architecture through the formulation and analysis of a stochastic model. The D3 architecture is a hybrid control/dataflow approach that takes advantage of inherent parallelism present in a program by dynamically scheduling program threads based on data availability and it also takes advantage of locality through the use of conventional processing elements that execute the program threads. The model is validated by comparing the deterministic and stochastic model responses. After model validation, various input parameters are varied such as the number of available processing elements and average threadlength, then the performance of the architecture is evaluated. The stochastic model is based upon a closed queueing network and utilizes the concepts of available parallelism and virtual queues in order to be reduced to a Markovian system. Experiments with varying computation engine threadlengths and communication latencies indicate a high degree of tolerance with respect to exploited parallelism.
APA, Harvard, Vancouver, ISO, and other styles
44

Zhang, Chen, Xin’an Wang, Shanshan Yong, Yining Zhang, Qiuping Li, and Chenyang Wang. "An Energy-Efficient Convolutional Neural Network Processor Architecture Based on a Systolic Array." Applied Sciences 12, no. 24 (2022): 12633. http://dx.doi.org/10.3390/app122412633.

Full text
Abstract:
Deep convolutional neural networks (CNNs) have shown strong abilities in the application of artificial intelligence. However, due to their extensive amount of computation, traditional processors have low energy efficiency when executing CNN algorithms, which is unacceptable for portable devices with limited hardware cost and battery capacity, so designing a CNN-specific processor is necessary. In this paper, we propose an energy-efficient CNN processor architecture for lightweight devices with a processing elements (PEs) array consisting of 384 PEs. Using the systolic array-based PE array, it realizes parallel operations between filter rows and between channels of output feature maps, supporting the acceleration of 3D convolution and fully connected computation with various parameters by configuring internal instruction registers. The computing strategy based on the proposed systolic dataflow achieves less hardware overhead compared with other strategies, and the reuse of image values and weight values, which effectively reduce the power of memory access. A memory system with a multi-level storage structure combined with register file (RF) and SRAM is used in the proposed CNN processor, which further reduces the energy overhead of computing. The proposed CNN processor architecture has been verified on a ZC706 FPGA platform using VGG-16 based on the proposed image segmentation method, the evaluation results indicate that the peak throughput achieves 115.2 GOP/s consuming 3.801 W at 150 MHz, energy efficiency and DSP efficiency reaches 30.32 GOP/s/W and 0.26 GOP/s/DSP, respectively.
APA, Harvard, Vancouver, ISO, and other styles
45

Shivarudraiah, Sumalatha, and Rajeswari Rajeswari. "Serial parallel dataflow-pipelined processing architecture based accelerator for 2D transform-quantization in video coder and decoder." IAES International Journal of Artificial Intelligence (IJ-AI) 14, no. 1 (2025): 798. http://dx.doi.org/10.11591/ijai.v14.i1.pp798-809.

Full text
Abstract:
The video coder and decoder (CODEC) standards from MPEG-4 to the recent versatile video codec (VVC), adopted lossy compression methodologies, which involves transformation, quantization and entropy coding. The growing usage of video data in all means of communication demands more bandwidth and storage requirements. While compression with redundancy removal by transform coefficient coding, the focal point is the crucial sequential data flow and data processing structures. Handling the block wise data near to the processing unit prior and after computation will reduce the data waiting time of the processing unit, hence accelerating the targeted functionality. The proposed serial parallel data-flow pipelined processing architecture (SPDPA) accelerates the speed of processing unit by on chip data availability and parallel data accessing options and also with the pipeline operations of transformation, data transpose and quantization. The post implementation results of the architecture targeted to 16 nm and 28 nm field programmable gate array (FPGA) shows that there is a trade-off between power and frequency of operations for various block sizes. The design targeted to 16 nm works for higher frequencies with an average power consumption 0.64 w as compared to 28 nm FPGA which consumes less average power of 0.15 w.
APA, Harvard, Vancouver, ISO, and other styles
46

Sumalatha, Shivarudraiah, and Rajeswari Rajeswari. "Serial parallel dataflow-pipelined processing architecture based accelerator for 2D transform-quantization in video coder and decoder." IAES International Journal of Artificial Intelligence (IJ-AI) 14, no. 1 (2025): 798–809. https://doi.org/10.11591/ijai.v14.i1.pp798-809.

Full text
Abstract:
The video coder and decoder (CODEC) standards from MPEG-4 to the recent versatile video codec (VVC), adopted lossy compression methodologies, which involves transformation, quantization and entropy coding. The growing usage of video data in all means of communication demands more bandwidth and storage requirements. While compression with redundancy removal by transform coefficient coding, the focal point is the crucial sequential data flow and data processing structures. Handling the block wise data near to the processing unit prior and after computation will reduce the data waiting time of the processing unit, hence accelerating the targeted functionality. The proposed serial parallel data-flow pipelined processing architecture (SPDPA) accelerates the speed of processing unit by on chip data availability and parallel data accessing options and also with the pipeline operations of transformation, data transpose and quantization. The post implementation results of the architecture targeted to 16 nm and 28 nm field programmable gate array (FPGA) shows that there is a trade-off between power and frequency of operations for various block sizes. The design targeted to 16 nm works for higher frequencies with an average power consumption 0.64 w as compared to 28 nm FPGA which consumes less average power of 0.15 w.
APA, Harvard, Vancouver, ISO, and other styles
47

Xie, Guangwei, Xitian Fan, Zhongchen Huang, Wei Cao, and Fan Zhang. "PassRecover: A Multi-FPGA System for End-to-End Offline Password Recovery Acceleration." Electronics 14, no. 7 (2025): 1415. https://doi.org/10.3390/electronics14071415.

Full text
Abstract:
In the domain of password recovery, deep learning has emerged as a pivotal technology for enhancing recovery efficiency. Despite its effectiveness, the inherent computation complexity of deep learning-based password generation algorithms poses substantial challenges, particularly in achieving synergistic acceleration between deep learning inference, and plaintext encryption process. In this paper, we introduce PassRecover, a multi-FPGA-based computing system that can simultaneously accelerate deep learning-driven password generation and plaintext encryption in an end-to-end manner. The system architecture incorporates a neural processing unit (NPU) and an encryption array configured to operate under a streaming dataflow paradigm for parallel processing. It is the first approach to explore the benefit of end-to-end offline password recovery. For comprehensive evaluation, PassRecover is benchmarked against PassGAN and five industry-standard encryption algorithms (Office2010, Office2013, PDF1.7, Winzip, and RAR5). Experimental results demonstrate excellent performance: Compared to the latest work that only accelerate encryption algorithms, PassRecover achieves an average 101.5% speedup across all tested encryption algorithms. When compared to graphics processing unit (GPU)-based end-to-end implementations, this work delivers 93.01% faster processing speeds and 3.73× superior energy efficiency. These results establish PassRecover as a promising solution for resource-constrained password recovery scenarios requiring high throughput and energy efficiency.
APA, Harvard, Vancouver, ISO, and other styles
48

Lu, Anni, Xiaochen Peng, Yandong Luo, Shanshi Huang, and Shimeng Yu. "A Runtime Reconfigurable Design of Compute-in-Memory–Based Hardware Accelerator for Deep Learning Inference." ACM Transactions on Design Automation of Electronic Systems 26, no. 6 (2021): 1–18. http://dx.doi.org/10.1145/3460436.

Full text
Abstract:
Compute-in-memory (CIM) is an attractive solution to address the “memory wall” challenges for the extensive computation in deep learning hardware accelerators. For custom ASIC design, a specific chip instance is restricted to a specific network during runtime. However, the development cycle of the hardware is normally far behind the emergence of new algorithms. Although some of the reported CIM-based architectures can adapt to different deep neural network (DNN) models, few details about the dataflow or control were disclosed to enable such an assumption. Instruction set architecture (ISA) could support high flexibility, but its complexity would be an obstacle to efficiency. In this article, a runtime reconfigurable design methodology of CIM-based accelerators is proposed to support a class of convolutional neural networks running on one prefabricated chip instance with ASIC-like efficiency. First, several design aspects are investigated: (1) the reconfigurable weight mapping method; (2) the input side of data transmission, mainly about the weight reloading; and (3) the output side of data processing, mainly about the reconfigurable accumulation. Then, a system-level performance benchmark is performed for the inference of different DNN models, such as VGG-8 on a CIFAR-10 dataset and AlexNet GoogLeNet, ResNet-18, and DenseNet-121 on an ImageNet dataset to measure the trade-offs between runtime reconfigurability, chip area, memory utilization, throughput, and energy efficiency.
APA, Harvard, Vancouver, ISO, and other styles
49

Silva, Alexandra. "Position Automata for Kleene Algebra with Tests." Scientific Annals of Computer Science XXII, no. 2 (2012): 367–94. https://doi.org/10.7561/SACS.2012.2.367.

Full text
Abstract:
Kleene algebra with tests (KAT) is an equational system that combines Kleene and Boolean algebras. One can model basic programming constructs and assertions in KAT, which allowed for its application in compiler optimization, program transformation and dataflow analysis. To provide semantics for KAT expressions, Kozen first introduced emph{automata on guarded strings}, showing that the regular sets of guarded strings plays the same role in KAT as regular languages play in Kleene algebra. Recently, Kozen described an elegant algorithm, based on “derivatives”, to construct a deterministic automaton that accepts the guarded strings denoted by a KAT expression. This algorithm generalizes Brzozowski’s algorithm for regular expressions and inherits its inefficiency arising from the explicit computation of derivatives. In the context of classical regular expressions, many efficient algorithms to compile expressions to automata have been proposed. One of those algorithms was devised by Berry and Sethi in the 80’s (we shall refer to it as Berry-Sethi construction/algorithm, but in the literature it is also referred to as position or Glushkov automata algorithm). In this paper, we show how the Berry-Sethi algorithm can be used to compile a $KAT$ expression to an automaton on guarded strings. Moreover, we propose a new automata model for KAT expressions and adapt the construction of Berry and Sethi to this new model.
APA, Harvard, Vancouver, ISO, and other styles
50

INTERLANDI, MATTEO, and LETIZIA TANCA. "A datalog-based computational model for coordination-free, data-parallel systems." Theory and Practice of Logic Programming 18, no. 5-6 (2018): 874–927. http://dx.doi.org/10.1017/s147106841800042x.

Full text
Abstract:
AbstractCloud computingrefers to maximizing efficiency by sharing computational and storage resources, whiledata-parallel systemsexploit the resources available in the cloud to perform parallel transformations over large amounts of data. In the same line, considerable emphasis has been recently given to two apparently disjoint research topics:data-parallel, andeventually consistent, distributedsystems.Declarative networkinghas been recently proposed to ease the task of programming in the cloud, by allowing the programmer to express only the desired result and leave the implementation details to the responsibility of the run-time system. In this context, we deem it appropriate to propose a study on alogic-programming-based computational modelfor eventually consistent, data-parallel systems, the keystone of which is provided by the recent finding that the class of programs that can be computed in an eventually consistent, coordination-free way is that ofmonotonic programs. This principle is called Consistency and Logical Monotonicity (CALM) and has been proven by Amelootet al.for distributed, asynchronous settings. We advocate that CALM should be employed as a basic theoretical tool also for data-parallel systems, wherein computation usually proceeds synchronously in rounds and where communication is assumed to be reliable. We deem this problem relevant and interesting, especially for what concernsparallel dataflow optimizations. Nowadays, we are in fact witnessing an increasing concern about understanding which properties distinguish synchronous from asynchronous parallel processing, and when the latter can replace the former. It is general opinion that coordination-freedom can be seen as a major discriminant factor. In this work, we make the case that the current form of CALM does not hold in general for data-parallel systems, and show how, using novel techniques, the satisfiability of the CALM principle can still be obtained although just for the subclass of programs calledconnected monotonic queries. We complete the study with considerations on the relationships between our model and the one employed by Amelootet al., showing that our techniques subsume the latter when the synchronization constraints imposed on the system are loosened.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!