Coding a Simple Average Can Be Complex
Choose the Right Average
In most cases you need the arithmetic mean, which you get by summing all the numbers and dividing by the size of the set.
This is not always the right choice. Sometimes you need the Harmonic Mean, the geometric mean or some other type of mean. These may be significantly different. The geometric mean lets you compare two sets of data that cover different ranges while the logarithmic mean is useful in heat transfer calculations.
Here I will only look at the arithmetic mean. If you want to know about other types of mean or are suffering from insomnia consult a textbook.
The Arithmetic Mean
Assuming you or your architect have decided that the arithmetic mean is the correct mean for the project (and that this is an important choice) calculating the mean should be easy. Well it is if the numbers are all well behaved available and the values or size of the set do not change during the calculation.
Suppose however, as an example, we have a sensor and need to know the mean value of its output (or, to make matters a little more complex a sliding mean). This might be dirty data from a wave power generator and a sliding mean the best way to clean it up to minimum usable standard. This is a big data scenario with lots of data coming in fast.
The first thing you need to know is whether updating the average can be done before new data arrives. If the time taken to update the average is (significantly) less than the minimum interval between the arrival of one datum and the next life is again fairly simple.
Except that you may need to know how to tell when the average needs updating. One way would be to duplicate the sensor output using one copy to alert your program that it needs to read the datum from the analogue to digital converter that received the other copy. This could be done with memory mapped graphics. Another option is to use a special value, such as Java’s NULL to show that no new data has been received and poll the location where the new data is stored. If it is not null use the value then set it to NULL.
If the data arrives randomly and the mean interval between data arriving is acceptable you may be able to use a buffer, say a queue. If the data still arrives too fast you may need faster hardware or to delegate computation of partial sums to specialist hardware. In the case of the sensore above this may mean some form of analogue smoothing.
These considerations are unlikely ever to affect the average (sorry) enterprise developer who can rely on their library’s MEAN(..) method.
The purpose of this section has been to show that a slightly different use case may require more complex computation and an unexpected degree of thought rather than discuss minute details.
Badly Behaved numbers
In Mathematics numbers are numbers abstract objects and can be manipulated without problems. In a computer numbers are represented by a finite string of bits and adding two numbers or numbers with very similar magnitude and different signs could give an inaccurate result. Integers can overflow and floating point numbers can introduce rounding error. The simplest solution is to use infinite precision arithmetic, which will increase the computation time. Other solutions include sorting the data and adding the sorted values starting with the smallest or sorting the data into bins and keeping a running total of the contents of each bin.
Summary
Even a mathematically simple computation can spark complexity in specialist use cases. This will be especially true in the case of “Big Data”: high volume, high velocity and highly variable or in the embedded space where space and time may be highly constrained (think car brakes).