# IEEE Floating Point Standard (IEEE754 floating-point representation standard)

Floating-point and fixed-point number representation is in commonly used in computer representation so we must understand the principles, especially in FPGA, because FPGA is not like in MCU directly with the multiplication and division.

### Fixed point numbers

First talk about the fixed-point simple, fixed point number is to overcome the integer representation cannot represent the real defects, then we can achieve through the real numbers by the last fraction, of course if the score is 2^-i times, then the fixed-point number our representation is accurate representation, but unfortunately we nature tree not so coincidentally. So the fixed-point number can only be myopia real numbers, including floating point also is such. The specific implementation is expressed as a hexadecimal 2, then move the K in the left, then for the fixed-point N as representation, there is N-1-K represents an integer part, low k scores the highest representation, symbolic representation.

Fixed-point number representation of the defect lies in its form is too stiff, decimal point position fixed determines the integer and fractional parts fixed figure, is not conducive to the expression of particularly large number or very small number.

### Floating point number

Floating point representation is more complex, this expression to express the real using scientific notation, which uses a mantissa (Mantissa), a base (Base), an index (Exponent) and a positive and negative symbols to express the real. For example, 123.45 counting decimal science can be expressed as 1.2345 × 102, of which 1.2345 for the mantissa, 10 as the base, 2 index. Floating point number using index reached a floating point effect, the real can be flexibly to express a wider range of.
Tip: the mantissa is sometimes referred to as effective digital (Significand). Mantissa is actually an informal term effective digital.

In the IEEE standard, floating point number is consecutive bytes specified length of all the binary segmentation for specific width symbols domain, domain index and mantissa domain three domain, which preserve the values were used to represent a given binary floating-point notation, exponent and mantissa. In this way, can be adjusted by the mantissa and exponent (so called " floating point ") can express the given value. The specific format, see the diagram below:

This can be seen in the S symbol, EXponent index, also is the "floating" index, index scope on 32 bit systems (0-2^8 -1) in /2 or 0-127, in the double 2^11 bit floating index.

Index domain, binary scientific notation corresponds to us before the introduction of the index part. The single precision number 8, double precision number is 11. With single precision numbers for example, 8 bit index can be expressed numerically 255 refers to between 0 to 255. However, the index can be positive, can also be negative. In order to deal with the negative index, the index value of deviation and the last required (Bias) values as stored in the index in the domain value, deviation of single precision degree value is 127, and the deviation of double precision number value is 1023. For example, a practical single precision index value of 0 in the domain index will be saved as 127; and stored in the index in the domain of the 64 represent the actual index -63. Deviation makes for single precision numbers, the actual expression refers to the numerical range becomes between -127 to 128 (inclusive). We will soon see, the actual index -127 (save for the full 0) and +128 (save for the full 1) reserved for treatment of special value. So, the effective index of actual can be expressed in -127 and 127. In this paper, the minimum index and maximum index respectively by Emin and Emax expression.

Legend in the third domain for the mantissa domain, in which single precision number is 23 bits long, double precision number is 52 bits long. In addition to what we are going to talk about some special value, the requirements of IEEE standard floating-point number must be standardized. This means that the left tail decimal point must be 1, so when we save the mantissa, can omit the decimal point in front of the 1, thus freeing up a bit to save more mantissa. So we actually use 23 bits long mantissa domain expression of the mantissa 24 bit. As for single precision numbers, binary 1001.101 (corresponding to 9.625 decimal) can be expressed as 1.001101 × 23, so the actual stored in the mantissa field values in the 00110100000000000000000, remove the left of the decimal point in 1, and 0 on the right side up.

### Reference resources:

IEEE 754 Floating-Point Format

IEEE 754 FLOATING POINT REPRESENTATION

Posted by Dean at May 07, 2014 - 1:08 AM