14 November 2011

Using MapReduce to refactor entities on GAE

I've been following Ikai Lan's good tutorial for MapReduce on App Engine (Java):

As a simple excercise I've managed to succesfully refactor some entities on the production server.
The goal is to take existing rows of an existing entity called Event and rename a field from Program to Project.
On the App Engine dashboard I can query for them with this GQL String:
"SELECT * FROM Event WHERE program != null".
At the end of this procedure, the GQL query shouldn't return any row.

Here are four simple steps.

1. Edit mapreduce.xml where I specify my mapper class and pass the Entity kind 'Event' as a parameter

  <configuration name="Program to Project">  
    <name human="Entity Kind to Map Over">mapreduce.mapper.inputformat.datastoreinputformat.entitykind</name>  
    <value template="optional">Event</value>  
    <name human="Entity Kind to Map Over">mapreduce.mapper.inputformat.datastoreinputformat.entitykind</name>  

2. Implement the mapper class

 package com.sirtrack.iridium.mapper;  
 import com.google.appengine.api.datastore.DatastoreService;  
 import com.google.appengine.api.datastore.DatastoreServiceFactory;  
 import com.google.appengine.api.datastore.Entity;  
 import com.google.appengine.api.datastore.Key;  
 import com.google.appengine.tools.mapreduce.AppEngineMapper;  
 import com.googlecode.objectify.cache.CachingDatastoreServiceFactory;  
 import org.apache.hadoop.io.NullWritable;  
 import java.util.logging.Logger;  
 public class ProgramToProject extends AppEngineMapper< Key,Entity,NullWritable,NullWritable>  
  private static final Logger log = Logger.getLogger( ProgramToProject.class.getName() );  
  private DatastoreService datastore;  
  public ProgramToProject()  
  public void taskSetup( Context context )  
   this.datastore = DatastoreServiceFactory.getDatastoreService();  
  public void map( Key key, Entity value, Context context )  
   log.warning( "Mapping key: " + key );  
   if( value.hasProperty( "program" ) )  
    Object program = value.getProperty( "program" );  
    value.setProperty( "project", program );  
    value.setProperty( "program", null );  
    datastore.put( value );  

3. Add a link to the MapReduce admin page to the dashboard (appengine-web.xml). In my case, I've mapped the mapreduce serverlet to /_ah/mapreduce, a protected address

   <page name="Appstats" url="/_ah/appstats" />  
   <page name="Mapreduce" url="/_ah/mapreduce/status" />  

4. Run the job from the dashboard

Program to Project

Job #job_1321220718997aeac4fdbfc3f4a7cba174e4f28845724_0001

Processed items per shard


  • DONE
  • Elapsed time00:00:23
  • Start timeMon Nov 14 2011 10:45:19 GMT+1300 (NZDT)


  • org.apache.hadoop.mapred.Task$Counter:MAP_INPUT_RECORDS983 (42.21/sec avg.)

Done! I can now verify that it all worked using the same GQL query I used before.

Conclusion and Future Improvements

MapReduce is a great tool for GAE, and also very easy to use but I'd like to use my favourite API (Siena) instead of the low level access to the datastore. It shouldn't be difficult.
There's also a more efficient way of performing this task, also covered by Ikai's tutorial.