Goto Computer Systems Notes main page
Inherent, truncation and round off errors
Many computer users, who are not aware of how numbers are represented and processed within computer systems, tend to assume that the results of programs are accurate. For example, if a decimal number is printed as 123.45678 all the figures are assumed to be an accurate representation of the result of whatever calculation was performed.
When computer programs
are being implemented and tested the errors which can occur include:
The error in the
evaluation of an algorithm can be defined to be the true value minus the
approximate value as calculated. Usually in practice only the approximate
value (as calculated by the program) is known. In general, however, it
is possible to determine something about the error without knowing it precisely.
For example, using the technique of Interval Halving to find the root of:
4 - 9x3 - 2x2
+ 120x - 130
The iterations could stop at the point when a root was between -3.60016 and 3.60013 and it is not necessary to halve and try again. In this case the maximum error would be 0.00003 and the root was -3.600145 + 0.000015. Thus, although the value of the error is not known it can be stated that it does not exceed 1.5 * 10-5. This error is called an absolute error and one common notation is to write a bar over the symbol to indicate an approximation and to write an e with subscript to stand for the error. Thus, if x is the true value:
x = x + ex
Thus the absolute
error is the difference between the true value and the approximation.
Relative error
Another way to define the error is using the relative error form which is defined as the absolute error divided by the approximation. It may seem more reasonable to divide the absolute error by the true value but usually this is not known. So long as the error is small the above definition using approximate value should have no sizable bearing on the numerical value of the relative error.
Comparison of using absolute and relative error
For numbers close to 1 the absolute and relative errors are nearly equal but for large and small numbers the error values can be very different. For example, if we have a true value of 0.00006 and an approximation of 0.00005, the absolute error is only 10-5but the relative error is 0.2 or 20%. On the other hand, if the true value is 100,500 and approximation is 100,100 the absolute error is 500 but the relative error is only 0.005, or 0.5%. Thus in general relative errors are the more useful way of specifying errors.
One extension of
the above when writing general purpose software, for example subroutines
to find roots, is to use relative error tests to determine if convergence
has occurred to within a certain error value. The same routines can then
be used for small and large magnitude numbers. For example, to find if
convergence has occurred between successive iterations Xn and Xn+1
the relative test
ABS((Xn - Xn+1) / Xn) < error will work
well in general. The value 'error' would then be the relative error of
the result (assuming that successive iterations converge smoothly).
Real numbers
Real number types are required in many mathematical, scientific and engineering applications to hold numeric values that may not be represented in integer (whole number) form. For example, numeric values tha may:
Within the computer hardware real numbers are implemented in floating point format which has two important attributes:
The accuracy to which real number calculations can be carried out using a particular real type is determined by the precision, i.e. the maximum number of decimal places that the floating value carries. In practice, the number of bits used to represent real types is implementation dependent so long as the minimum precision is maintained and each type must give at least the same range and precision as the previous type. It is therefore possible for the real types to be three distinct sizes or all the same length (float is typically 32-bit, double 64-bit and many compilers implement long double as double).
The standard header
file <float.h> provides details on the range and precision of real numbers
enabling programs to determine if the specific implementation can support
the application. Table 1 shows the range and precision of the real types
under Turbo C++ V3.00 running on an IBM PC compatible.
| float | double | long double | |
| size in bytes | 4 | 8 | 10 |
| range minimum | 1.175494e-38 | 2.2250738585e-308 | 3.3621031431e-4932 |
| range maximum | 3.402823e+38 | 1.7976931348e+308 | 1.1897314953e+4932 |
| precision digits | 6 | 15 | 19 |
| precision epsilon | 1.192092e-07 | 2.2204460493e-16 | 1.0842021725e-19 |
Table 1 Range and precision of the real types under Turbo C++ V3.00
In addition to the number of digits of precision for each real type <float.h> contains a value epsilon which is an indicator of the precision to which calculations can be carried out (see Table 1). Epsilon is specified as (!= means not equal):
1.0 + epsilon != 1.0
The value of epsilon
is the smallest number which when added to 1.0
changes the value, i.e. adding any smaller value to 1.0
still gives 1.0.
The higher the precision of the real type the smaller epsilon and programs
can check this value to see if calculations can be carried out to the required
precision, i.e. using floating point numbers with a precision of 4
digits epsilon would be 0.001.
For example, consider calculating the roots of the equation:
x2 + 0.4002x + 0.00008 = 0
Using the formula
x = - b + ((b2 - 4ac))1/2
2a
The true root is - 0.0002. However, when using four-digit arithmetic the root would be - 0.0001625 (an error of 25%), i.e.:
using 8 digit arithmetic
x = - 0.4002 + ((0.16016004 - 0.00032))1/2
= - 0.0002
2.0
using 4 digit arithmetic
x = - 0.4002 + ((0.1602 - 0.0003))1/2
= - 0.0001625
2.0
The four-digit arithmetic assumed that numbers were rounded to the nearest digit.
The type chosen in practice is application dependent, i.e. on the precision and range of the data to be processed. For example, there is no point in sampling experimental data accurate to 3 decimal digits (e.g. measurements made using a good analogue meter) and then processing it using double variables and printing the results to 10 or 12 digits. In fact results so presented Consider the Taylor series for the sine function:
sin x = x - x3 + x5
- x7 + .......
3! 5! 7!
This series is, in
theory, valid for any finite angle. The truncation error committed by stopping
the summation after a finite number of terms is less than the absolute
value of the first term neglected. This is true if there were some way
to keep an infinite number of digits when calculating each arithmetic result.
For example, using six-digit arithmetic to calculate the sine of 1470oand
terminating the summation when a term is less than 10-8
gives
sine (1470o)
= 891.799. Even if sixteen-digit
arithmetic is used sine(2550o)
= 158. Thus serious errors are
occurring due to limitations in the accuracy of computing the terms in
the series.
Below is shown a summary of results of a C program which calculates and prints the sine for various angles (using GNU C++ V2.6.3 under DOS). It can be seen that when the angle is below 5.0 radians the value calculated is the same as that returned by the maths function sin(x) to approximately six figures. When the angle is 10 radians the accuracy of the sum is approximately 2 to 3 figures and when the angle is above 20 radians the results are complete nonsense ! The series is valid for all angles, assuming that the calculations are performed to infinite accuracy, and the problem with the results is due to rounding errors in floating point calculations (which will be discussed in the next section).
sin( 0.50) =
0.47942552, library =
0.47942554
sin( 1.00) =
0.84147096, library =
0.84147098
sin( 2.00) =
0.90929741, library =
0.90929743
sin( 5.00) =
- 0.95892441, library =
- 0.95892427
sin(10.00) =
- 0.54420567, library =
- 0.54402111
sin(15.00) =
0.66512936, library =
0.65028784
sin(20.00) =
1.52592909, library =
0.91294525
sin(25.00) =
- 5.36588192, library =
- 0.13235175
sin(30.00) = - 24048.72656250,
library = - 0.98803162
sin(35.00) =- 2340664.25000000,
library = - 0.42818267
Inherent Errors
Inherent errors are errors in the values of data. For example, caused by uncertainty in measurements, by outright blunders, or by the necessary approximate nature of representing some number by a finite number of digits when that number cannot be represented exactly by that number of digits, e.g.SQRT(2.0).
When physical data is being processed errors of measurement are bound to occur. When measuring a voltage with an analogue meter (meter with a dial and indicator pointer) it is difficult to take readings with an accuracy greater than 2 or 3 significant figures. This can be improved by using a digital voltmeter (reading presented on a digital readout) where an accuracy of 4 to 8 significant can be achieved depending upon the meter. Thus physical values are usually presented + some error.
Mistakes in taking measurements are common, varying from a simple misreading of a meter scale to copying data down incorrectly on to paper. Computer systems that can sample information directly using analogue to digital converters can be a great aid in this area in addition to being able to sample far more data in a shorter time than a human operator.
Many numbers cannot be represented exactly in a given number of decimal digits. For example, n can be written 3.14, 3.14159265 or 3.141592653589793. In any case we have no exact value for which is a transcendental number and therefore has no exact decimal representation. When using such numbers in computation one would give the number to the maximum number of significant figures that the arithmetic works to. For example, if the computer uses 8 figure arithmetic it is no use specifying it to 20 figures, the computer would not store the rightmost digits. An additional problem is that computers tend to work with the binary number system and numbers that can be represented in decimal exactly have no exact equivalent in binary. One example is the number 0.1 decimal whose binary equivalent 0.0001100110011001100 ...., which is a non terminating binary number. Thus, forming the sum of 10 numbers, each of which uses the binary equivalent of 0.1 , will not give the result 1.0 but some approximation.
Note that transcendental and irrational numbers do not have a terminating representation in any number base. Rational numbers, however, may or may not have a terminating representation depending upon the number base.
Truncation Errors
Many mathematical processes are infinite (e.g. the sine series), which in practice have to be terminated at some point. The terms omitted (which are infinite in number) introduce an error into the result. This error is called the truncation error, since it is caused by the truncation of an infinite process. Many of the algorithms used in processing data and calculating results use infinite series or processes so this error is of great importance.
The decision to be taken is at what point to terminate the process. This can be a simple examination of the relative change in the result between two terms of the process or even when a term falls below a particular number, e.g. in computing sine we could ignore all terms after 10-8. This is usually fine if the series is well behaved and converges smoothly but if the terms can vary in size making the convergence not so smooth some other test must be applied. For example, the series could be truncated if the change in the result was below a certain value over ten terms (or even more).
Rounding errors
The above discussion showed how four and eight digit real number arithmetic affected the determination of the roots of a quadratic equation. In programming languages such as Fortran, C/C++ real numbers are stored in floating point form, i.e. a fractional part and an exponent. Decimal floating point numbers are in the form f*10e where f is the fraction in the range 0.1 < f < 1.0 and e is the exponent, e.g. 7392.0 would be 0.7392 * 104 and 0.0007392 would be 0.7392 * 10-3. Note, however, that computer systems use base 2 or 16 for floating point numbers, e.g. f * 2e, but the following discussion still applies.
When computation
is being carried out the fractional part of a decimal floating point must
be within the range
0.1 <
f < 1 and if at any time the
leading digit becomes 0 then
the fractional part is shifted one place left and the exponent decremented,
e.g. 0.0023 *104
becomes 0.23 * 102.
This process of getting the fractional part in the correct range is called
normalisation.
Now suppose that
two floating point numbers which are accurate to four significant figures
are to be added:
z = 1.246 + 0.03290 = 0.1246*101 + 0.3290*10-1
Before the add can
take place the floating point number exponents must be aligned, i.e. one
of the numbers unnormalized:
z = 0.1246*101 + 0.00329*101
This gives the result
0.1278 * 101
assuming four figure accuracy. Note that the digits of the fraction shifted
out of the capacity of the system have been lost. Furthermore, no rounding
was carried out when the digits were lost. Many computer systems do not
round in this situation so it is a likely source of error in the program.
Now consider subtraction:
z = 26.31 - 19.76 = 0.2631*102 - 0.1976*102 = 0.0655*102
this has resulted
in a leading zero so it will be shifted:
z = 0.6550*101
Notice two things:
Now consider the evaluation of the sine series. As the signs of the terms alternate rounding errors can result from the subtraction process (as described above). In addition, if x is large the value of xn can become a very large number. Below is a run of a program evaluating the sine of an angle of 25 radians. The final value of the sine of an angle must be in the range -1.0 to 1.0 but the run shows values as large as 5726042112.0. As the float calculations are only carried out to 6 or 7 figures of accuracy the last four figures of the number are meaningless. We are therefore attempting to get a result in the range -1.0 to 1.0 using values that are not accurate to hundreds or even thousands.
n term_n sin_x
5.000000000
-2604.166748047
-2579.166748047
7.000000000
81380.210937500
78801.046875000
9.000000000
-1211015.000000000
-1132214.000000000
11.000000000
10512283.000000000
9380069.000000000
13.000000000
-59728880.000000000
-50348812.000000000
15.000000000
239298400.000000000
188949584.000000000
17.000000000
-712197632.000000000
-523248064.000000000
19.000000000
1636483584.000000000
1113235456.000000000
21.000000000
-2990649856.000000000
-1877414400.000000000
23.000000000
4450371584.000000000
2572957184.000000000
25.000000000
-5497000448.000000000
-2924043264.000000000
27.000000000
5726042112.000000000
2801998848.000000000
29.000000000
-5097971712.000000000
-2295972864.000000000
31.000000000
3923931392.000000000
1627958528.000000000
33.000000000
-2637050624.000000000
-1009092096.000000000
35.000000000
1560754432.000000000
551662336.000000000
37.000000000
-819723968.000000000
-268061632.000000000
39.000000000
384630240.000000000
116568608.000000000
41.000000000
-162209104.000000000
-45640496.000000000
43.000000000
61817492.000000000
16176996.000000000
45.000000000
-21393096.000000000
-5216100.000000000
47.000000000
6752871.000000000
1536771.000000000
49.000000000
-1952148.125000000
-415377.125000000
51.000000000
518746.843750000
103369.718750000
53.000000000
-127143.835937500
-23774.117187500
55.000000000
28833.417968750
5059.300781250
57.000000000
-6067.638671875
-1008.337890625
59.000000000
1188.055786133
179.717895508
61.000000000
-216.988571167
-37.270675659
63.000000000
37.054058075
-0.216617584
65.000000000
-5.929028988
-6.145646572
67.000000000
0.890779614
-5.254867077
69.000000000
-0.125901684
-5.380768776
71.000000000
0.016770791
-5.363997936
73.000000000
-0.002109003
-5.366106987
75.000000000
0.000250785
-5.365856171
77.000000000
-0.000028242
-5.365884304
79.000000000
0.000003016
-5.365881443
81.000000000
-0.000000306
-5.365881920
83.000000000
0.000000030
-5.365881920
85.000000000
-0.000000003
-5.365881920
sin(25.00) = -5.36588192, library = -0.13235175
Using higher precision types to overcome rounding errors
One way to reduce the effect of rounding errors is to increase the precision and range of the variables and constants used. This, however, has an overhead in that the extra computation involved increases the run-time of the program. In many applications there is often a trade-off between the accuracy desired and the run-time acceptable.
Increasing precision,
however, only reduces the problem, it does not eliminate it. Consider the
following results produced using float, double and long double versions
of the sin(x) function (under GNU C++). It may be seen that even long double
(19 decimal digits of precision) is only effective up to about 30 radians.
| angle | float | double | long double | library |
| 1
2.00 5.00 10.00 20.00 30.00 40.00 50.00 60.00 |
0.841471
0.909297 -0.958924 -0.544206 1.525929 -24048.726562 523443136.000000 -8887603298304.000001 -22202444742131712.000689 |
0.841471
0.909297 -0.958924 -0.544021 0.912945 -0.988053 -0.605842 9991.828290 27872935.154430 |
0.841471
0.909297 -0.958924 -0.544021 0.912945 -0.988032 0.744634 13.111824 147617.705161 |
0.841471
0.909297 -0.958924 -0.544021 0.912945 -0.988032 0.745113 -0.262375 -0.304811 |
Modification of the algorithm to overcome rounding errors
A change to the algorithm
may be possible, e.g.:
x = x - two_pi * int(x / two_pi); // get x in range 0 to 2pi
The expression int(x / two_pi) evaluates, as an integer, how many times 2n goes into the angle x. This integer value is then multiplied by 2n and subtracted from the angle x the result being an angle in the range 0 to 2n. The problem is that casting the floating result of x / two_pi to an int may result in integer overflow. This can be overcome by using the maths library function floor:
double floor(double x); // returns the largest integer below x
which returns as a double function result the largest integer below x (the effect is similar to an int cast but avoids the danger of overflow). Hence the statement can be rewritten (and it also works for negative angles):
x = x - two_pi * floor(x / two_pi); // get x in range 0 to 2pi
The table below shows results produced using float, double and long double versions of the updated sine function. The double and long double results are as accurate as the maths library function to the six figures displayed. The error in the float is due to rounding errors in the evaluation of the angle in the range 0 to 2n.
The above discussion
has necessarily been very brief; refer to a text on numerical methods for
a full discussion of such problems and the techniques used to overcome
them (Press et al. 1994).
| angle | float | double | long double | library |
|
1
10.00 100.00 1000.00 10000.00 100000.00 1000000.00 10000000.00 100000000.00 1000000000.00 10000000000.00 |
0.841471
-0.544021 -0.506368 0.826864 -0.305349 0.038530 -0.375923 0.653595 -0.744683 -0.854296 -0.729213 |
0.841471
-0.544021 -0.506368 0.826864 -0.305615 0.035749 -0.349992 0.420548 0.931639 0.545843 -0.487507 |
0.841471
-0.544021 -0.506368 0.826864 -0.305615 0.035749 -0.349992 0.420548 0.931639 0.545843 -0.487505 |
0.841471
-0.544021 -0.506368 0.826864 -0.305614 0.035749 -0.349994 0.420548 0.931639 0.545843 -0.487506 |
1
If the result of a calculation is less than 0.1*10-99
the number is lost and usually treated as zero. This can
lead to fantastic results if this result is used for further calculations.
This is underflow.
2
If the result of an operation produces an exponent of a 100
or
greater this is larger than the system can
store and floating overflow has occurred. This is generally a fatal run
time error and the program stops.
Use of guard digits
In the arithmetic/logic unit in some systems (e.g. the IBM 360/370 series) there is a guard digit position. This may be thought of as standing to the immediate right of the fraction of the number with the smaller exponent, into which digits are moved during the shifting process required to align the exponents, i.e. during unnormalization. Then if the sum turns out not to be normalised, that one digit can be brought back during normalisation.
For example:
z = 0.1064*10 - 0.3173*102 = 0.1064*103 - 0.0317(3)*103
note the extra digit
position. What happens is that 0.03173
is subtracted from 0.10640
to give 0.07467. This
can now be normalised to give four figures:0.7467*102.
Without the guard digit the result would have been
using four figure
arithmetic 0.7470*102.
This use of a guard digit can give greater accuracy in computation when such unnormalization/normalisation processing. In addition the guard digit can be used to make provision for rounding but this is not usually done.
Other number bases
In general computer
arithmetic is carried out in binary:
x = f * 2e
where 1 / 2 < | f | < 1
or hexadecimal:
x = f * 16e
where 1 / 16 < | f | < 1
but the arguments
applied above still apply.
Round off Errors Continued
Consider the addition
using four digit arithmetic of:
z = 0.1246101 + 0.3290*10-1 =
0.12789*101
but seeing how we
work to four digit accuracy this will be chopped to 0.1278*101
(no rounding) giving an error of 0.00009*101.
We can write this as:
Z = 0.1278*101 + 0.9*10-3
notice that the exponent of the second term is four less than the exponent of the first term using our four figure arithmetic.
In general we can
consider that arithmetic operations will produce a result that can be broken
into halves which we can denote (before rounding):
y = fy * 10e + gy
* 10e-t
where t is the number of digits in the mantissa of the floating point number. Now the range of fy as it is normalised is:
1 / 10 < | f | < 1
But, gy being the error part, may not be normalised and could be zero:
0 < | gy | < 1
The question is 'what can we do to fy, our result, taking into account the value of gy'. Many computer systems just ignore it and this process is usually called chopping. In fact if no guard digits are kept this is the only option. However, if guard digits are kept the value of gy could be used to modify fy in some way to reduce the error.
A bound on the maximum
relative error in a chopped arithmetic result is easily found. The maximum
error occurs when gy is large and fy is small, i.e. the maximum value of
gy is less than
1.0
and the minimum size of fy is
0.1. The absolute value of the
relative error is therefore:
| ey | | gy
* 10e-t | 1.0 * 10e-t
|----| = |-------------| < -------------
= 10-t+1
| y | | fy
*
1-e | 0.1*10e
Remembering that t is the number of digits in the mantissa or fractional part of any floating point number we have the result that the maximum relative rounding error in the result of floating point arithmetic operations does not depend upon the size of the numbers. For example using four figure arithmetic the maximum relative rounding error is 10-3.
Now consider the more usual form of rounding called symmetric rounding. This can be expressed as follows: given the two parts of a result, as above, let the rounded approximation to y to be given by:
| fy| * 10e
if | gy
|
< 1/2
| y | =
| fy
| 10e + 10e-t if | gy |
> 1/2
where y- has the same sign as fy. The addition of the term 10e-t in the second line corresponds to adding 1 into the last digit position retained if the first figure dropped is 5 or greater.
Thus if gy
< 1 the absolute
error is:
ey
= gy * 10e-t
if gy
>
1/2 it is:
ey
= 1 - gy * 10e-t
In either case we have 10e-t multiplied by a factor whose absolute value is no greater than 1. Thus the absolute value of the absolute error is therefore:
ey *
10 e-t
and the absolute
value of the relative error is therefore:
| ey | | 1 * 10
|e-t
| --- | = | ----------- | < 5 * 10-t
= 1 * 10-t+1
| _ | |
|
| y | | 0.1 * 10 -e |
For example, of the
difference between the two rounding rules consider:
y =0.7324 * 103 + 0.8261 * 10-1
using chopping
_
y = 0.7324 * 103
and
| ey | | 0.8261 * 10-1|
| -- | = | ------------ | ~ 1.1 * 10-4
| _ | |
|
| y | | 0.7324 * 103 |
whereas for symmetric
rounding
_
y = 0.7325*103 and
ey = -0.1739 * 10-1 giving:
| ey |
| -- | = 0.24 * 10-4
| _ |
| y |
Thus one can see that the error from rounding is considerably less than that from chopping. In fact the rounding error never exceeds the chopping error and is less about half the time.