http://ikaisays.com/2010/07/09/using-the-java-mapper-framework-for-app-engine/
As a simple excercise I've managed to succesfully refactor some entities on the production server.
The goal is to take existing rows of an existing entity called Event and rename a field from Program to Project.
On the App Engine dashboard I can query for them with this GQL String:
"SELECT * FROM Event WHERE program != null".
At the end of this procedure, the GQL query shouldn't return any row.
Here are four simple steps.
1. Edit mapreduce.xml where I specify my mapper class and pass the Entity kind 'Event' as a parameter
<configurations>
<configuration name="Program to Project">
<property>
<name>mapreduce.map.class</name>
<value>com.sirtrack.iridium.mapper.ProgramToProject</value>
</property>
<property>
<name>mapreduce.inputformat.class</name>
<value>com.google.appengine.tools.mapreduce.DatastoreInputFormat</value>
</property>
<!--property>
<name human="Entity Kind to Map Over">mapreduce.mapper.inputformat.datastoreinputformat.entitykind</name>
<value template="optional">Event</value>
</property-->
<property>
<name human="Entity Kind to Map Over">mapreduce.mapper.inputformat.datastoreinputformat.entitykind</name>
<value>Event</value>
</property>
</configuration>
</configurations>
2. Implement the mapper class
package com.sirtrack.iridium.mapper;
import com.google.appengine.api.datastore.DatastoreService;
import com.google.appengine.api.datastore.DatastoreServiceFactory;
import com.google.appengine.api.datastore.Entity;
import com.google.appengine.api.datastore.Key;
import com.google.appengine.tools.mapreduce.AppEngineMapper;
import com.googlecode.objectify.cache.CachingDatastoreServiceFactory;
import org.apache.hadoop.io.NullWritable;
import java.util.logging.Logger;
public class ProgramToProject extends AppEngineMapper< Key,Entity,NullWritable,NullWritable>
{
private static final Logger log = Logger.getLogger( ProgramToProject.class.getName() );
private DatastoreService datastore;
public ProgramToProject()
{
}
@Override
public void taskSetup( Context context )
{
this.datastore = DatastoreServiceFactory.getDatastoreService();
}
@Override
public void map( Key key, Entity value, Context context )
{
log.warning( "Mapping key: " + key );
if( value.hasProperty( "program" ) )
{
Object program = value.getProperty( "program" );
value.setProperty( "project", program );
value.setProperty( "program", null );
datastore.put( value );
}
}
}
3. Add a link to the MapReduce admin page to the dashboard (appengine-web.xml). In my case, I've mapped the mapreduce serverlet to /_ah/mapreduce, a protected address
<admin-console>
<page name="Appstats" url="/_ah/appstats" />
<page name="Mapreduce" url="/_ah/mapreduce/status" />
</admin-console>
4. Run the job from the dashboard
Program to Project
Job #job_1321220718997aeac4fdbfc3f4a7cba174e4f28845724_0001
Processed items per shard
Overview
- DONE
- Elapsed time: 00:00:23
- Start time: Mon Nov 14 2011 10:45:19 GMT+1300 (NZDT)
Counters
- org.apache.hadoop.mapred.Task$Counter:MAP_INPUT_RECORDS: 983 (42.21/sec avg.)
Done! I can now verify that it all worked using the same GQL query I used before.
Conclusion and Future Improvements
MapReduce is a great tool for GAE, and also very easy to use but I'd like to use my favourite API (Siena) instead of the low level access to the datastore. It shouldn't be difficult.Conclusion and Future Improvements
There's also a more efficient way of performing this task, also covered by Ikai's tutorial.