14 November 2011

Using MapReduce to refactor entities on GAE

I've been following Ikai Lan's good tutorial for MapReduce on App Engine (Java):

As a simple excercise I've managed to succesfully refactor some entities on the production server.
The goal is to take existing rows of an existing entity called Event and rename a field from Program to Project.
On the App Engine dashboard I can query for them with this GQL String:
"SELECT * FROM Event WHERE program != null".
At the end of this procedure, the GQL query shouldn't return any row.

Here are four simple steps.

1. Edit mapreduce.xml where I specify my mapper class and pass the Entity kind 'Event' as a parameter

  <configuration name="Program to Project">  
    <name human="Entity Kind to Map Over">mapreduce.mapper.inputformat.datastoreinputformat.entitykind</name>  
    <value template="optional">Event</value>  
    <name human="Entity Kind to Map Over">mapreduce.mapper.inputformat.datastoreinputformat.entitykind</name>  

2. Implement the mapper class

 package com.sirtrack.iridium.mapper;  
 import com.google.appengine.api.datastore.DatastoreService;  
 import com.google.appengine.api.datastore.DatastoreServiceFactory;  
 import com.google.appengine.api.datastore.Entity;  
 import com.google.appengine.api.datastore.Key;  
 import com.google.appengine.tools.mapreduce.AppEngineMapper;  
 import com.googlecode.objectify.cache.CachingDatastoreServiceFactory;  
 import org.apache.hadoop.io.NullWritable;  
 import java.util.logging.Logger;  
 public class ProgramToProject extends AppEngineMapper< Key,Entity,NullWritable,NullWritable>  
  private static final Logger log = Logger.getLogger( ProgramToProject.class.getName() );  
  private DatastoreService datastore;  
  public ProgramToProject()  
  public void taskSetup( Context context )  
   this.datastore = DatastoreServiceFactory.getDatastoreService();  
  public void map( Key key, Entity value, Context context )  
   log.warning( "Mapping key: " + key );  
   if( value.hasProperty( "program" ) )  
    Object program = value.getProperty( "program" );  
    value.setProperty( "project", program );  
    value.setProperty( "program", null );  
    datastore.put( value );  

3. Add a link to the MapReduce admin page to the dashboard (appengine-web.xml). In my case, I've mapped the mapreduce serverlet to /_ah/mapreduce, a protected address

   <page name="Appstats" url="/_ah/appstats" />  
   <page name="Mapreduce" url="/_ah/mapreduce/status" />  

4. Run the job from the dashboard

Program to Project

Job #job_1321220718997aeac4fdbfc3f4a7cba174e4f28845724_0001

Processed items per shard


  • DONE
  • Elapsed time00:00:23
  • Start timeMon Nov 14 2011 10:45:19 GMT+1300 (NZDT)


  • org.apache.hadoop.mapred.Task$Counter:MAP_INPUT_RECORDS983 (42.21/sec avg.)

Done! I can now verify that it all worked using the same GQL query I used before.

Conclusion and Future Improvements

MapReduce is a great tool for GAE, and also very easy to use but I'd like to use my favourite API (Siena) instead of the low level access to the datastore. It shouldn't be difficult.
There's also a more efficient way of performing this task, also covered by Ikai's tutorial.

16 February 2011

My experience with listing a property on TradeMe.co.nz

we've been listing our house for almost three weeks, here are some thoughts:

1. It's too expensive. Listing privately costs us more than $260, but just $60 to an agent

2. The process itself is crap, the UI is crap. I can't reorder photos for example. The edit session isn't saved (I had to re-enter twice), very poor.

3. No insights other than number of views. How do I know out of the 2102 visits so far how many were genuine visitors and not search engines? You don't even tell me how many people have put it on their watchlist.

Come on TradeMe, you've got a monopoly now, have you stopped innovating?