ArtsAutosBooksBusinessEducationEntertainmentFamilyFashionFoodGamesGenderHealthHolidaysHomeHubPagesPersonal FinancePetsPoliticsReligionSportsTechnologyTravel

Calculating moving averages with Hadoop and Map-Reduce

Updated on December 28, 2014

Please note that the article by Attain technologies is a direct copy of this article. This article, the one your are reading, is the original and was published on 29th August 2012. I did not give permission for htis and only found out by accident.

A moving average, also called a sliding mean or a rolling average, is a technique for smoothing out short term variation in data in order to reveal trends. It has uses in signal processing, data analysis and statistics. In engineering terms it is like a finite impulse response filter, while mathematically it is a convolution. For large sets of data this smoothing can take a long time and treating it like a big data problem may allow faster, possibly real time, smoothing of data.

The De Facto standard for processing Big Data is Hadoop, though other frameworks may possibly be superior in terms of development, maintenance time, and throughput. Some even claim to allow realtime processing of big data on a standard Laptop.

Hadoop is centred on the Map-Reduce paradigm, a paradigm that is bigger than Hadoop, and can to some extent be implemented with queue based thread communication. The code presented here is proof of concept Java code for Hadoop developed partly as a learning exercise and partly as a way to break out of the Big Data cage in which Map-Reduce currently resides. As such details necessary for production systems, such as the ability to choose the window length, requantise of the data or perform weighted and exponential averages have not been discussed. The code here will be updated in the light of user comments and future development.

The sliding mean

I first met the sliding mean in the context of analysis of telemetry data. The data was contaminated by noise from various sources: vibration, electrical noise and so on. One solution was to smooth the data. Since the time of events was important and the data might be subject to systematic biases, and the noise not, therefore, centred round a mean value, the smoothed data had to be “shifted” to compensate.

To be concrete, consider a sequence of a thousand noisy data points, which are integers between 0 and 128. A sliding mean with a sample window of 9 points ( The reason for 9 not ten is explained below) is calculated by taking the average of points 1 to 9 and placing it in location 1 of the smoothed data, then calculating the average of points 2 to 10 and putting it in location 2 and so on till the smoothed array is filled. If the data are not centred round a mean value the average lags the latest data point by half the window length.

Sometimes the time shift implied in the last paragraph is not wanted, in this case the central moving average is computed: in this case the smoothed sequence starts at location 5 and ends at location 995 with the end points normally discarded.

A hint of mathematics

The sliding mean is just a convolution [2]or loosely “smearing” of the time series with a square pulse, that is a function that is zero outside the sample window. As such one way to compute the sliding mean is to take the Fast Fourier Transform (FFT) of the data and multiply it point wise by the FFT of the square pulse ( or in the case of weighted means by the FFT of the weighting function). In computational terms the effort of calculating the FFT may outweigh the saving made by replacing division with multiplication.

Pursuing the notion of Sliding Mean as convolution leads into areas like Fractional Calculus [3] and is outwith the scope of this note. Using the convolution approach leads to the observation that the square pulse of the simple moving average has a Fourier transform that falls off rapidly at hight frequencies, so the smoothing eliminates high frequencies. The larger the sample window the lower the maximum frequency that is passed through. This is intuitively obvious but it is nice to have that confirmed by a more rigorous analysis

How the data arrives

There are basically two situations, the first is where the entire sequence is available, for example post test data from electromechanical systems, and the situation where data points arrive one by one, possibly at random times. Here it is assumed the data arrive at fixed times. An intermediate case is where the data is stored in a buffer with old data being removed as new data arrives. Here the smoothing must be fast enough to ensure all data is processed before new data arrives, even if this means the system is idle waiting for new data.

In order to develop the proof of concept rapidly test data was stored in a text file, one value to a line thus simulating the situation where data values arrive one by one. Because the data is not all available at once the standard technique of calculating a total once then moving forward by subtracting the oldest point and adding the newest point was not possible. Instead a technique essentially similar to the add() method of a circular list was developed for use in the Mapper class with the list being converted to a string and sent to the reducer with each new point that is added, after the list becomes full. Again as the code evolves smarter methods will be used and the code here updated.

Sample code

	public class HadoopMovingAverage {
// For production the windowlength  would be a commandline or other argument		
		 static double windowlength = 3.0;
		 static int thekey = (int)windowlength/2;
// used for handling the circular list. 
		 static boolean initialised=false;
// Sample window
		 static ArrayList<Double> window = new ArrayList<Double>() ;

// The Map method processes the data one point at a time and passes the circular list to the 
// reducer. 
	   public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> {
	     private final static IntWritable one = new IntWritable(1);
	     private Text word = new Text();
	     public void map(LongWritable key,
	    		         Text value,
	    		         Text> output, 
	    		         Reporter reporter
	    		         ) throws IOException 
	       double wlen = windowlength;
// creates windows of samples and sends them to the Reducer	       
	       partitionData(value, output, wlen);

// Create sample windows starting at each sata point and sends them to the reducer
		private void partitionData(Text value,
				OutputCollector<Text, Text> output, double wlen)
				throws IOException {
			String line = value.toString();
// the division must be done this way in the mapper. 
			   Double ival = new Double(line)/wlen;
// Build initial sample window
			   if(window.size() < windowlength)
// emit first window
			  if(!initialised && window.size() == windowlength)
				  initialised = true;
// Update and emit subsequent windows
// remove oldest datum				  
// add new datum

// Transform list to a string and send to reducer. Text to be replaced by ObjectWritable
// Problem: Hadoop apparently requires all output formats to be the same so
//			cannot make this output collector differ from the one the reducer uses.

	   public static void emit(int key, 
			                   ArrayList<Double> value,
			                   OutputCollector<Text,Text> output) throws IOException
		   Text tx = new Text();
		   tx.set(new Integer(key).toString());

		   String outstring = value.toString();
// remove the square brackets Java puts in
		   String tidied = outstring.substring(1,outstring.length()-1).trim();
		   Text out = new Text();	   
	   public static class Reduce extends MapReduceBase 
	   	implements Reducer<Text, Text, Text, Text> 
	     public void reduce(Text key, 
	    		            Iterator<Text> values,
	    		            OutputCollector<Text,Text> output,
	    		            Reporter reporter
	    		            ) throws IOException 
		       while (values.hasNext()) 
				        computeAverage(key, values, output);

// computes the average of each window and sends to ouptut collector.      
		private void computeAverage(Text key, Iterator<Text> values,
				OutputCollector<Text, Text> output)
				throws IOException {
			double sum = 0;
			String thevalue =;
			String[] thenumbers = thevalue.split(",");
			for( String temp: thenumbers)
			    // need to trim the string because the constructor does not trim.
			    Double ds = new Double(temp.trim());	
			    sum += ds;

      Text out = new Text();
      String outstring = Double.toString(sum);
      output.collect(key, out);
	   public static void main(String[] args) throws Exception {
	     JobConf conf = new JobConf(WordCount.class);

//	     FileInputFormat.setInputPaths(conf, new Path(args[0]));
//	     FileOutputFormat.setOutputPath(conf, new Path(args[1]));
	     FileInputFormat.setInputPaths(conf, new Path("input/movingaverage.txt"));
	     FileOutputFormat.setOutputPath(conf, new Path("output/smoothed"));


The moving average is a commonly used technique for smoothing noisy data. Unlike for example, Kalman filtering, it is easy to understand and easy to implement. The case where data points arrive one by one is easily transformed to a map reduce program. Working proof of concept code has been presented to show that Map-Reduce and Hadoop can be used outside the big data domain.

Further Reading


    0 of 8192 characters used
    Post Comment

    No comments yet.


    This website uses cookies

    As a user in the EEA, your approval is needed on a few things. To provide a better website experience, uses cookies (and other similar technologies) and may collect, process, and share personal data. Please choose which areas of our service you consent to our doing so.

    For more information on managing or withdrawing consents and how we handle data, visit our Privacy Policy at:

    Show Details
    HubPages Device IDThis is used to identify particular browsers or devices when the access the service, and is used for security reasons.
    LoginThis is necessary to sign in to the HubPages Service.
    Google RecaptchaThis is used to prevent bots and spam. (Privacy Policy)
    AkismetThis is used to detect comment spam. (Privacy Policy)
    HubPages Google AnalyticsThis is used to provide data on traffic to our website, all personally identifyable data is anonymized. (Privacy Policy)
    HubPages Traffic PixelThis is used to collect data on traffic to articles and other pages on our site. Unless you are signed in to a HubPages account, all personally identifiable information is anonymized.
    Amazon Web ServicesThis is a cloud services platform that we used to host our service. (Privacy Policy)
    CloudflareThis is a cloud CDN service that we use to efficiently deliver files required for our service to operate such as javascript, cascading style sheets, images, and videos. (Privacy Policy)
    Google Hosted LibrariesJavascript software libraries such as jQuery are loaded at endpoints on the or domains, for performance and efficiency reasons. (Privacy Policy)
    Google Custom SearchThis is feature allows you to search the site. (Privacy Policy)
    Google MapsSome articles have Google Maps embedded in them. (Privacy Policy)
    Google ChartsThis is used to display charts and graphs on articles and the author center. (Privacy Policy)
    Google AdSense Host APIThis service allows you to sign up for or associate a Google AdSense account with HubPages, so that you can earn money from ads on your articles. No data is shared unless you engage with this feature. (Privacy Policy)
    Google YouTubeSome articles have YouTube videos embedded in them. (Privacy Policy)
    VimeoSome articles have Vimeo videos embedded in them. (Privacy Policy)
    PaypalThis is used for a registered author who enrolls in the HubPages Earnings program and requests to be paid via PayPal. No data is shared with Paypal unless you engage with this feature. (Privacy Policy)
    Facebook LoginYou can use this to streamline signing up for, or signing in to your Hubpages account. No data is shared with Facebook unless you engage with this feature. (Privacy Policy)
    MavenThis supports the Maven widget and search functionality. (Privacy Policy)
    Google AdSenseThis is an ad network. (Privacy Policy)
    Google DoubleClickGoogle provides ad serving technology and runs an ad network. (Privacy Policy)
    Index ExchangeThis is an ad network. (Privacy Policy)
    SovrnThis is an ad network. (Privacy Policy)
    Facebook AdsThis is an ad network. (Privacy Policy)
    Amazon Unified Ad MarketplaceThis is an ad network. (Privacy Policy)
    AppNexusThis is an ad network. (Privacy Policy)
    OpenxThis is an ad network. (Privacy Policy)
    Rubicon ProjectThis is an ad network. (Privacy Policy)
    TripleLiftThis is an ad network. (Privacy Policy)
    Say MediaWe partner with Say Media to deliver ad campaigns on our sites. (Privacy Policy)
    Remarketing PixelsWe may use remarketing pixels from advertising networks such as Google AdWords, Bing Ads, and Facebook in order to advertise the HubPages Service to people that have visited our sites.
    Conversion Tracking PixelsWe may use conversion tracking pixels from advertising networks such as Google AdWords, Bing Ads, and Facebook in order to identify when an advertisement has successfully resulted in the desired action, such as signing up for the HubPages Service or publishing an article on the HubPages Service.
    Author Google AnalyticsThis is used to provide traffic data and reports to the authors of articles on the HubPages Service. (Privacy Policy)
    ComscoreComScore is a media measurement and analytics company providing marketing data and analytics to enterprises, media and advertising agencies, and publishers. Non-consent will result in ComScore only processing obfuscated personal data. (Privacy Policy)
    Amazon Tracking PixelSome articles display amazon products as part of the Amazon Affiliate program, this pixel provides traffic statistics for those products (Privacy Policy)
    ClickscoThis is a data management platform studying reader behavior (Privacy Policy)