Dissertations / Theses: 'IEEE floating-point'

1

Jain, Sheetal A. 1980. "Low-power single-precision IEEE Floating-point unit." Thesis, Massachusetts Institute of Technology, 2003. http://hdl.handle.net/1721.1/87426.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Kolumban, Gaspar. "Low Cost Floating-Point Extensions to a Fixed-Point SIMD Datapath." Thesis, Linköpings universitet, Datorteknik, 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-101586.

Full text

Abstract:

The ePUMA architecture is a novel master-multi-SIMD DSP platform aimed at low-power computing, like for embedded or hand-held devices for example. It is both a configurable and scalable platform, designed for multimedia and communications. Numbers with both integer and fractional parts are often used in computers because many important algorithms make use of them, like signal and image processing for example. A good way of representing these types of numbers is with a floating-point representation. The ePUMA platform currently supports a fixed-point representation, so the goal of this thesis will be to implement twelve basic floating-point arithmetic operations and two conversion operations onto an already existing datapath, conforming as much as possible to the IEEE 754-2008 standard for floating-point representation. The implementation should be done at a low hardware and power consumption cost. The target frequency will be 500MHz. The implementation will be compared with dedicated DesignWare components and the implementation will also be compared with floating-point done in software in ePUMA. This thesis presents a solution that on average increases the VPE datapath hardware cost by 15% and the power consumption increases by 15% on average. Highest clock frequency with the solution is 473MHz. The target clock frequency of 500MHz is thus not achieved but considering the lack of register retiming in the synthesis step, 500MHz can most likely be reached with this design.

APA, Harvard, Vancouver, ISO, and other styles

3

Tarnoff, David. "Episode 3.07 – Introduction to Floating Point Binary and IEEE 754 Notation." Digital Commons @ East Tennessee State University, 2020. https://dc.etsu.edu/computer-organization-design-oer/23.

Full text

Abstract:

Regardless of the numeric base, scientific notation breaks numbers into three parts: sign, mantissa, and exponent. In this episode, we discuss how the computer stores those three parts to memory, and why IEEE 754 puts them together the way it does.

APA, Harvard, Vancouver, ISO, and other styles

4

Shafer, Lawrence E. "Data Driven Calculations Histories to Minimize IEEE-755 Floating-point Computational Error." NSUWorks, 2004. http://nsuworks.nova.edu/gscis_etd/830.

Full text

Abstract:

The widely implemented and used IEEE-754 Floating-point specification defines a method by which floating-point values may be represented in fixed-width storage. This fixed-width storage does not allow the exact value of all rational values to be stored. While this is an accepted limitation of using the IEEE-754 specification, this problem is compounded when non-exact values are used to compute other values. Attempts to manage this problem have been limited to software implementations that require special programming at the source code level. While this approach works, the problem coder must be aware of the software and explicitly write high-level code specifically referencing it. The entirety of a calculation is not available to the special software so optimum results cannot always be obtained when the range of operand values is large. This dissertation proposes and implements an architecture that uses integer algorithms to minimize precision loss in complex floating-point calculations. This is done using runtime calculation operand values at a simulated hardware level. These calculations are coded in a high-level language such that the coder is not knowledgeable about the details of how the calculation is performed.

APA, Harvard, Vancouver, ISO, and other styles

5

Pathanjali, Nandini. "Pipelined IEEE-754 Double Precision Floating Point Arithmetic Operators on Virtex FPGA’s." University of Cincinnati / OhioLINK, 2002. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1017085297.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

Pathanjali, Nandini. "Pipelined IEEE-754 double precision floating point arithmetic operators on virtex FPGA's." Cincinnati, Ohio : University of Cincinnati, 2002. http://rave.ohiolink.edu/etdc/view?acc%5Fnum=ucin1017085297.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

Liu, Qiong. "Design of an IEEE double precision floating-point adder/subtractor in GaAs technology /." Title page, table of contents and abstract only, 1995. http://web4.library.adelaide.edu.au/theses/09ENS/09ensl793.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Abdel-Hamid, Amr Talaat. "A hierarchical verification of the IEEE-754 table-driven floating-point exponential function using HOL." Thesis, National Library of Canada = Bibliothèque nationale du Canada, 2001. http://www.collectionscanada.ca/obj/s4/f2/dsk3/ftp05/MQ64057.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

9

De, Blasio Simone, and Karpers Fredrik Ekstedt. "Comparing the precision in matrix multiplication between Posits and IEEE 754 floating-points : Assessing precision improvement with emerging floating-point formats." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-280036.

Full text

Abstract:

IEEE 754 floating-points are the current standard way to represent real values in computers, but there are alternative formats emerging. One of these emerging formats are Posits. The main characteristic of Posit is that the format allows for higher precision than IEEE 754 floats of the same bit size for numbers of magnitude close to 1, but lower precision for numbers of much smaller or bigger magnitude. This study compared the precision between IEEE 754 floating-point and Posit when it comes to matrix multiplication. Different sizes of matrices are compared, combined with different intervals which the values of the matrix elements were generated in. The results showed that Posits outperformed IEEE 754 floating-point numbers in terms of precision when the values are in an interval equal to or larger than [􀀀0:01; 0:01), or equal to or smaller than [􀀀100; 100). Matrix size did not affect this much, unless the intermediate format Quire was used to eliminate rounding error. For almost all other intervals, IEEE 754 floats performed better than Posits. Although most of our results favored IEEE 754 floats, Posits does have a precision benefit if one can be sure the data is within the ideal interval. Maybe Posits still have a role to play in the future of floating-point formats. IEEE 754 flyttal är den nuvarande standarden för att representera reella tal i datorer, men det finns framväxande alternativa format. Ett av dessa nya format är Posit. Huvudkarakteristiken för Posit är att formatet möjliggör för högre precision än IEEE 754 flyttal med samma bitstorlek för värden av magnitud nära 1, men lägre precision för värden av mycket mindre eller större magnitud Denna studie jämförde precisionen mellan flyttal av formaten IEEE 754 och Posit när det gäller matrismultiplikation. Olika storlekar av matriser jämfördes, samt olika intervall av värden som matriselementen genererades i. Resultaten visade att Posits presterade bättre än IEEE 754 flyttal när det gäller precision när värdena är i ett intervall lika med eller större än [􀀀0:01; 0:01), eller lika med eller mindre än [􀀀100; 100). Matrisstorlek hade inte en anmärkningsvärd effekt på detta förutom när formatet Quire användes för att eliminera avrundningsfel. I nästan alla andra intervall presterade IEEE 754 flyttal bättre än Posits. Även om de flesta av våra resultat gynnade IEEE 754-flyttal, har Posits en precisions fördel om man kan vara säker på att värdena ligger inom det ideella intervallet. Posits kan alltså ha en roll att spela i framtiden för representation av flyttal.

APA, Harvard, Vancouver, ISO, and other styles

10

Jourdan, Jingyan. "Custom floating-point arithmetic for integer processors : algorithms, implementation, and selection." Phd thesis, Ecole normale supérieure de lyon - ENS LYON, 2012. http://tel.archives-ouvertes.fr/tel-00779764.

Full text

Abstract:

Media processing applications typically involve numerical blocks that exhibit regular floating-point computation patterns. For processors whose architecture supports only integer arithmetic, these patterns can be profitably turned into custom operators, coming in addition to the five basic ones (+, -, X, / and √), but achieving better performance by treating more operations. This thesis addresses the design of such custom operators as well as the techniques developed in the compiler to select them in application codes. We have designed optimized implementations for a set of custom operators which includes squaring, scaling, adding two nonnegative terms, fused multiply-add, fused square-add (x*x+z, with z>=0), two-dimensional dot products (DP2), sums of two squares, as well as simultaneous addition/subtraction and sine/cosine. With novel algorithms targeting high instruction-level parallelism and detailed here for squaring, scaling, DP2, and sin/cos, we achieve speedups of up to 4.2x for individual custom operators even when subnormal numbers are fully supported. Furthermore, we introduce the optimizations developed in the ST231 C/C++ compiler for selecting such operators. Most of the selections are achieved at high level, using syntactic criteria. However, for fused square-add, we also enhance the framework of integer range analysis to support floating-point variables in order to prove the required positivity condition z>= 0. Finally, we provide quantitative evidence of the benefits to support this selection of custom operations: on DSP kernels and benchmarks, our approach allows us to be up to 1.59x faster compared to the sole usage of basic ones.

APA, Harvard, Vancouver, ISO, and other styles

11

Nilsson, William, and Jakob Arvidsson. "An evaluation of a new standard for floating point precision : A quantitative comparison of Posit and IEEE 754 Float." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-302142.

Full text

Abstract:

Posit is a new floating point precision representation that has been developed as an alternative to the existing IEEE 754 floating point standard. Prior studies have found several use-cases for posit, but also potential drawbacks. This thesis examines some of the potential drawbacks concerning the precision of posit’s floating point representation. The study is conducted through comparisons of 32-bit posit with 32-bit IEEE 754 for a limited set of operations and algorithms. The result shows that posit is the most accurate at values near one, progressively losing accuracy as the absolute value of the exponent in the number being represented grows larger. Eventually there is a break point where float becomes more precise. The choice of operators or algorithms was not found to have any affect on the precision of the result. Posit är en ny flyttalsrepresentation som har tagits fram som ett alternativ till den nuvarande flyttalsstandarden, IEEE 754. Tidigare studier har hittat flera användningsområden för posit, men även potentiella nackdelar. Denna avhandling undersöker några av de potentiella nackdelarna gällande precisionen i posits flyttalsrepresentation. Undersökningen görs genom att jämföra resultaten för 32-bitars posit med 32-bitars IEEE 754 flyttal för ett begränsat urval av operationer samt algoritmer. Resultatet visar att posit har högst precision nära 1 och precisionen minskar när absolutvärdet av exponenten i talet som ska representeras växer. Så småningom inträffar en brytpunkt varpå float har bättre precision. Valet av operationer eller algoritmer påvisades inte ha någon påverkan på precisionen av resultatet.

APA, Harvard, Vancouver, ISO, and other styles

12

Besseling, Johan, and Anders Renström. "A comparative study of IEEE 754 32-bit Float and Posit 32-bit floating point format on precision. : Using numerical methods." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-280101.

Full text

Abstract:

Posit is a new way of representing floating points in computers. This thesis investigates the precision of the 32-bit Posit floating point format compared to the current standard 32-bit IEEE 754 Float format by conducting tests with numerical methods. Posit was chosen due to its promising results in previous work. The numerical analytical methods that where chosen was the Least Square Method, Gauss Newton Interpolation Method, Trapezoid Method and Newton Raphsons Method. Results from the tests show that Posit32 performs at least as high precision as IEEE 754 Float on computations larger than a range of 1 and above but tends to increase precision up to three significant figures when moving towards a range of 0 - 1. Posit är ett nytt sätt att representera flytande punkter i datorer. Den här avhandlingen undersöker precisionen i 32-bitars Posit-flytpunktsformat jämfört med nuvarande standard 32-bitars IEEE 754 Float-format genom att utföra test med numeriska metoder. Posit valdes på grund av sina lovande resultat i tidigare arbete. De numeriska analysmetoderna som valde var Minstakvadrat metoden, Gauss Newton Interpolation Metod, Trapezoid Metod och Newton Raphsons Metod. Resultaten från testerna visar att Posit32 utför minst lika hög precision som IEEE 754 Float på beräkningar som är större än ett intervall mellan 1 och högre men tenderar att öka precisionen upp till tre värdesiffror när beräkningarna rör sig mot ett intervall mellan 0 - 1.

APA, Harvard, Vancouver, ISO, and other styles

13

Martin-Dorel, Erik. "Contributions à la vérification formelle d'algorithmes arithmétiques." Phd thesis, Ecole normale supérieure de lyon - ENS LYON, 2012. http://tel.archives-ouvertes.fr/tel-00745553.

Full text

Abstract:

L'implantation en Virgule Flottante (VF) d'une fonction à valeurs réelles est réalisée avec arrondi correct si le résultat calculé est toujours égal à l'arrondi de la valeur exacte, ce qui présente de nombreux avantages. Mais pour implanter une fonction avec arrondi correct de manière fiable et efficace, il faut résoudre le "dilemme du fabricant de tables" (TMD en anglais). Deux algorithmes sophistiqués (L et SLZ) ont été conçus pour résoudre ce problème, via des calculs longs et complexes effectués par des implantations largement optimisées. D'où la motivation d'apporter des garanties fortes sur le résultat de ces pré-calculs coûteux. Dans ce but, nous utilisons l'assistant de preuves Coq. Tout d'abord nous développons une bibliothèque d'"approximation polynomiale rigoureuse", permettant de calculer un polynôme d'approximation et un intervalle bornant l'erreur d'approximation à l'intérieur de Coq. Cette formalisation est un élément clé pour valider la première étape de SLZ, ainsi que l'implantation d'une fonction mathématique en général (avec ou sans arrondi correct). Puis nous avons implanté en Coq, formellement prouvé et rendu effectif 3 vérifieurs de certificats, dont la preuve de correction dérive du lemme de Hensel que nous avons formalisé dans les cas univarié et bivarié. En particulier, notre "vérifieur ISValP" est un composant clé pour la certification formelle des résultats générés par SLZ. Ensuite, nous nous sommes intéressés à la preuve mathématique d'algorithmes VF en "précision augmentée" pour la racine carré et la norme euclidienne en 2D. Nous donnons des bornes inférieures fines sur la plus petite distance non nulle entre sqrt(x²+y²) et un midpoint, permettant de résoudre le TMD pour cette fonction bivariée. Enfin, lorsque différentes précisions VF sont disponibles, peut survenir le phénomène de "double-arrondi", qui peut changer le comportement de petits algorithmes usuels en arithmétique. Nous avons prouvé en Coq un ensemble de théorèmes décrivant le comportement de Fast2Sum avec double-arrondis.

APA, Harvard, Vancouver, ISO, and other styles

14

Jacobi, Christian [Verfasser]. "Formal verification of a fully IEEE compliant floating point unit / Christian Jacobi." 2004. http://d-nb.info/972323171/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

15

Liu, Qiong. "Design of an IEEE double precision floating-point adder/subtractor in GaAs technology." Thesis, 1995. http://hdl.handle.net/2440/122410.

Full text

Abstract:

This project aims to produce a 64-bit floating-point double precision adder/subtractor of a Solid Modelling accelerator in Gallium Arsenide technology, which is used to reduce computational time and to increase accuracy of algorithms. Addition is the most fundamental and the simplest operation in any computer arithmetic operation. According to the IEEE 754 Standard Format, the resulting architecture based on the addition/subtraction algorithm is mainly divided into two portions - the exponent and the mantissa. In logic design, there are three major difficult circuits including a mantissa shifter, a mantissa adder, and a normalizer, all of which affect the speed of the addition process. Design of a high speed operation not only relies on speed property of a gate, an efficient way to accelerate the process but also is on selection and design of the fastest feasible circuits, and an optimal placement to reduce interconnections. A rapid barrel shifter has been used as an alignment shifter and a normalization shifter, and an adder combined a carry-select adder and a binary carry-look ahead adder has been developed for adding the two mantissas. In normalization, a novel approach has been adopted while designing the encoder, which omits the 6-bit incrementer normally required in this process. As a result of the idiosyncrasies of GaAs technology, the design is much more difficult than CMOS. For simplicity of layout, a multi-bit-input circuit has been broken into several segments, then connected together to achieve the desired function. Furthermore, some examples of PLA implementation are given in the priority detector and the encoder. Thesis (MESc)--University of Adelaide, Dept. of Electrical and Electronic Engineering, 1996

APA, Harvard, Vancouver, ISO, and other styles

16

Shih, Wun-Cai, and 施文財. "Digital Signal Processing Scheme for Wearable Devices-Using Mixed Fixed-Point and IEEE-754 Floating-Point Digital Filter Implementations." Thesis, 2016. http://ndltd.ncl.edu.tw/handle/40791590197410585385.

Full text

Abstract:

碩士 亞洲大學 光電與通訊學系碩士在職專班 104 This thesis presents the hybrid architecture of floating-point and fixed-point com-putations using normal-form transformation for finite-precision infinite impulse response (IIR) digital filter state-space implementations. The proposed method pro-vides for well compromise between operational speed and output performance. To obtain higher operational speed, the state equation of a digital filter is implemented by fixed-point format. In contrary, to reduce the distortion occurred from underflow effect, floating-point format is implemented in the output equation. The developed digital filter implementation method is suitable for the use in narrow band imple-mentations, such as the narrow stop band IIR digital filter for eliminating power line interferences, or an extremely low pass band filter to reject the baseline wander in a wearable mini-ECG. We have found that the underflow effect is associated with the bandwidth and the sampling frequency in digital filter implementations. Some nu-merical examples illustrate the effectiveness of our proposed approach.

APA, Harvard, Vancouver, ISO, and other styles

17

Seidel, Peter-Michael [Verfasser]. "On the design of IEEE compliant floating-point units and their quantitative analysis / vorgelegt von Peter-Michael Seidel." 2007. http://d-nb.info/984957855/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'IEEE floating-point'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles