MapReduce for Distributed Indexing

Summary

MapReduce is a programming model commonly used within the Hadoop framework to access big data stored in the Hadoop File System (HDFS). It consists of several phases: Map, Shuffle, and Reduce.

Google popularised the term MapReduce in 2004, however, the concepts can easily be implemented using any programming language. Below is an implementation of MapReduce using C#.

Mapping Function

Function x = x + 2, and a list of integers 1 -> 3. Map will return:

 
      :              :
    /   \          /  \
  1      :       3     :
        /  \          /  \
      2     :       4     :
           /  \          /  \
         3     []      5    []
static void MapFunction(string[] args)
{
    var index = new List<int> { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 };
    var map = Map<int, int>(x => x + 2, index);

    map.ToList<int>().ForEach(i => Console.Write(i + " "));
    Console.WriteLine();
    Console.ReadKey();
}

static IEnumerable<TResult> Map<T, TResult>(Func<T, TResult> func, IEnumerable<T> list)
{
    foreach (var i in list)
        yield return func(i);
}

Reduce

Function (x, y) => x + y, and a list of elements 1 to 3, Reduce will return an aggregate as follows:

 
      :           
    /   \         
  1      :      f(0,1) = 1   
        /  \          \      
      2     :      f(1,2) = 3
           /  \          \
         3     []      f(3,3) = 6
static void ReduceFunction(string[] args)
{
    var index = new List<int> { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 };
    Console.WriteLine(Reduce<int, int>((x, y) => x + y, index, 0));
    Console.ReadKey();
}

static T Reduce<T, U>(Func<U, T, T> func, IEnumerable<U> list, T x)
{
    foreach (var i in list)
        x = func(i, x);

    return x;
}