Friday, May 27, 2011

Mahout Recommendations in Distributed mode with Hadoop Map Reduce


                The implementation of mahout Recommendations in completely different in distributed environment compared to the same on stand alone. In distributed environment the concept of Data Model and neighborhood ceases to exist, as data is distributed across multiple machines and computations are not just based on local data. In core when we take mahout into distributed mode there are a series of mappers and reducers involved in the process with multiple intermediate result sets. The entire recommendation process begins with that of computing the co-occurrence matrix and user vectors.
                Mahout distribution already provides a job to enable recommenders in distributed environment. Follow the below mentioned steps in order to implement mahout recommendations in Hadoop environment

1.       Go to the core directory of mahout distribution and run ‘mvn clean package’
(You should have maven installed in your pc)
Once this is done verify whether a job mahout-core-0.4-SNAPSHOT.job has been created within /target directory. This is a map reduce jar for computing recommendations

2.       Copy the input data set(input.txt) into hdfs
hadoop fs –copyFromLocal input.txt /userdata/input/input.txt

3.       Copy the file users.txt into hdfs. Users .txt should contain the list of user ids whose recommendations are required, one in a line.
hadoop fs –copyFromLocal users.txt /userdata/input/users.txt

4.       Run the recommender job

hadoop jar target/mahout-core-0.4-SNAPSHOT.job org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
-Dmapred.input.dir=/userdata/bejoy/mahout/input/input.txt
-Dmapred.output.dir=/userdata/bejoy/output
--usersFile input/users.txt
--numRecommendations 5
–b true
-s SIMILARITY_TANIMOTOCOEFFICIENT

All the parameters passed on with the job are self explanatory except a few like
--numRecommendations’ no of recommendations to be generated for each user id specified in the users.txt file
-b true’ indicates that the data set is Boolean
-s’ is used to indicate which similarity algorithm to be used for generating recommendations

5.       You can find the output at the following hdfs dir userdata/bejoy/mahout/output

When we go for distributed computations, the computations are calculated offline and the recommendations are stored in some rdbms/Hbase for retrieval in real time applications. When we go in for offline computations it is better to choose item based computations because in high traffic sites the lists of items grow at a very slower pace compared to list of users, items relationship. For an ecommerce site if we make user based recommendations offline it needn’t be accurate as there would be n users buying m items every minute and these entries are not considered while forming user neighborhood for recommendations made next moment.

Mahout Recommendations with Data Sets containing Alpha Numeric Item Ids


In real world data we can’t always ensure that the input data supplied to us in order to generate recommendations should contain only integer values for User and Item Ids. If these values or any one of these are not integers the default data models that mahout provides won’t be suitable to process our data. Here let us consider the case where out Item ID is Strings we’d define our custom data model. In our data model we need to override a method in order to read item id as string and convert the same into long and return the unique long value

Data Model Class

import java.io.File;
import java.io.IOException;

import org.apache.mahout.cf.taste.common.TasteException;
import org.apache.mahout.cf.taste.impl.model.file.FileDataModel;

public class AlphaItemFileDataModel extends FileDataModel {
      private final ItemMemIDMigrator  memIdMigtr = new ItemMemIDMigrator();
     
      public AlphaItemFileDataModel(File dataFile) throws IOException {
            super(dataFile);       
      }

      public AlphaItemFileDataModel(File dataFile, boolean transpose) throws IOException {
            super(dataFile, transpose);
      }

      @Override
      protected long readItemIDFromString(String value) {
            long retValue =  memIdMigtr.toLongID(value);
            if(null == memIdMigtr.toStringID(retValue)){
                  try {
                        memIdMigtr.singleInit(value);
                  } catch (TasteException e) {
                        e.printStackTrace();
                  }
            }
            return retValue;
      }
   
      String getItemIDAsString(long itemId){
            return memIdMigtr.toStringID(itemId);
      }
}

Class that defines the map to store the String to Long values

import org.apache.mahout.cf.taste.common.TasteException;
import org.apache.mahout.cf.taste.impl.common.FastByIDMap;
import org.apache.mahout.cf.taste.impl.model.AbstractIDMigrator;

      public  class ItemMemIDMigrator extends AbstractIDMigrator {
       
        private final FastByIDMap<String> longToString;
       
        public ItemMemIDMigrator() {
          this.longToString = new FastByIDMap<String>(100);
        }
       
        @Override
        public void storeMapping(long longID, String stringID) {
          synchronized (longToString) {
            longToString.put(longID, stringID);
          }
        }
       
        @Override
        public String toStringID(long longID) {
          synchronized (longToString) {
            return longToString.get(longID);
          }
        }
        public void singleInit(String stringID) throws TasteException {
            storeMapping(toLongID(stringID), stringID);
        }
       
      }

In your Recommender implementation you can use this Data Model class instead of the default file data model to accept an input that contains alpha numeric Item Ids. Similar you can device the code to form a data model that would accommodate alpha numeric User Ids as well.

Evaluating Mahout based Recommender Implementations


           Mahout provides you an option to evaluate your generated recommendations against the actual preference values.
        In mahout recommender evaluators, a part of the real preference data set is kept as test data. These test preferences won’t be there in the training data set (actual data set – test data set) which is fed to the recommender under evaluation (ie all data other than the test data is fed into the recommender as input). The recommender internal generates preferences for the test data and these calculated values are compared to actual values in the data set.
For this mahout uses two types of evaluations

1.       Average Absolute Difference Evaluator
                The average difference between the actual and estimates preference is calculated. Lower the value better the recommendations. Lower values means the estimated preference differed from the actual preferences only in a smaller extent. If this value is 0 it indicates that both the estimated and actual preferences are the same means perfect recommendations.

2.       Root Mean Square Evaluator
                Here we calculate the value of difference as the square root of the average of the squares of the differences between actual and estimated recommendations. In this evaluation also lower the score value better the recommendations. Also 0 refers to perfect recommendations.
               
The below code snippet shows an implementation of sample recommender evaluator.

Recommender Evaluator

import java.io.File;
import java.io.IOException;

import org.apache.mahout.cf.taste.common.TasteException;
import org.apache.mahout.cf.taste.eval.RecommenderBuilder;
import org.apache.mahout.cf.taste.eval.RecommenderEvaluator;
import org.apache.mahout.cf.taste.impl.eval.RMSRecommenderEvaluator;
import org.apache.mahout.cf.taste.impl.model.file.FileDataModel;
import org.apache.mahout.cf.taste.impl.neighborhood.NearestNUserNeighborhood;
import org.apache.mahout.cf.taste.impl.recommender.GenericBooleanPrefUserBasedRecommender;
import org.apache.mahout.cf.taste.impl.similarity.TanimotoCoefficientSimilarity;
import org.apache.mahout.cf.taste.model.DataModel;
import org.apache.mahout.cf.taste.neighborhood.UserNeighborhood;
import org.apache.mahout.cf.taste.recommender.Recommender;
import org.apache.mahout.cf.taste.similarity.UserSimilarity;
import org.apache.mahout.common.RandomUtils;

public class RecommenderEvaluvator {

      private static int neighbourhoodSize=7;
     
      public static void main(String args[])
      {
            String recsFile="D://inputData.txt";
           
            /*creating a RecommenderBuilder Objects with overriding the buildRecommender method
            this builder object is used as one of the parameters for RecommenderEvaluator - evaluate method*/
           
            //for Recommendation evaluations
            RecommenderBuilder userSimRecBuilder = new RecommenderBuilder() {
                  @Override
                  public Recommender buildRecommender(DataModel model)throws TasteException
                  {
                        //The Similarity algorithms used in your recommender
                        UserSimilarity userSimilarity = new TanimotoCoefficientSimilarity(model);
                       
                        /*The Neighborhood algorithms used in your recommender
                         not required if you are evaluating your item based recommendations*/
                        UserNeighborhood neighborhood =new NearestNUserNeighborhood(neighbourhoodSize, userSimilarity, model);
                       
                        //Recommender used in your real time implementation
                        Recommender recommender =new GenericBooleanPrefUserBasedRecommender(model, neighborhood, userSimilarity);
                        return recommender;
                  }
            };
           
            try {
                 
                  //Use this only if the code is for unit tests and other examples to guarantee repeated results
                  RandomUtils.useTestSeed();
                 
                  //Creating a data model to be passed on to RecommenderEvaluator - evaluate method
                  FileDataModel dataModel = new FileDataModel(new File(recsFile));
                 
                  /*Creating an RecommenderEvaluator to get the evaluation done
                  you can use AverageAbsoluteDifferenceRecommenderEvaluator() as well*/
                  RecommenderEvaluator evaluator = new RMSRecommenderEvaluator();
                 
                  //for obtaining User Similarity Evaluation Score
                  double userSimEvaluationScore = evaluator.evaluate(userSimRecBuilder,null,dataModel, 0.7, 1.0);
                  System.out.println("User Similarity Evaluation score : "+userSimEvaluationScore);
                                               
            } catch (IOException e) {
                  // TODO Auto-generated catch block
                  e.printStackTrace();
            } catch (TasteException e) {
                  // TODO Auto-generated catch block
                  e.printStackTrace();
            }
      }
}


                Lets us look into a bit detail on some lines of code in the above example which is very important

RandomUtils.useTestSeed()
                A lot of randomness is used inside the evaluator to choose the test data. With the usage of RandomUtils.useTestSeed() we can ensure that the evaluator chooses the same random data every time. Use this line in your evaluations if and only if you are going for unit test or examples that should guarantee same evaluation results every time. Never use it in your real code.

Evaluator.evaluate(builder,null,0.7,1.0)
                The core evaluation operation happens in this method. Let us look into each of the four parameters used in here.
The first parameter null is a place holder for the DataModelBuilder. Null would indicate the default value and it would be fine as long as you are not using any specialized implementation of DataModel in your recommender implementation.
The second parameter is the RecommenderBuilder you have created young the buildRecommender() method in your evaluator
The third and fourth parameter indicates the volume of input data to be considered for evaluation. 1.0 as the last parameter indicates that 100% of the input data is used for evaluation purposes. The 3rd parameter indicates the volume of data to be used to train the algorithm. 0.7 means 70% of the input data allocated would be used to train the algorithm and 30% would be used to perform the test. On a real time data set when the data volume is huge we'd normally take in only a small percent of actual input data to evaluate our recommenders over minor modifications on code each time, in such a case just choose a small portion of total data set say 10%, ie give the last parameter as 0.1 . There would be slight loss in accuracy but on test driven development it is a good approach.