Wednesday, October 23, 2013

Text output in Map/Reduce

This evening's exercise was a spike, trying to evaluate whether I could get a significant improvement in task performance by exerting more control over the write of the data.

Conclusion: don't bother.

The job in question was a very thin one - take binary data, and write it out as human readable text.  In any significant map/reduce job, the actual mapping and reducing is going to swamp the time required to copy bytes of data around.

That said, there were interesting things to be learned in the exercise.

The actual logic of the write - the point where the data irrevocably passes from your key/value to your job output - is controlled by FileOutputFormat.getRecordWriter.  This is an abstract method; the implementation in this specific instance is provided by TextOutputFormat, which creates a TextOutputFormat.LineRecordWriter.  LineRecordWriter.write(K,V) in turn calls LineRecordWriter.writeObject.

Now the fun begins.  writeObject is specifically looking for Text; if it can't cast the Object to Text, then it's going to ensure that the UTF-8 requirement is satisfied by invoking Object.toString(), and then UTF-8 encoding the result.  In other words, TextOutputFormat is constraining the output to data that must be UTF-8 encoded.

For anything that's not a String to start with, that's going to be two copies; String.getBytes() promises to return a new byte array.


So that's not so good.

First thought would be to replace the LineRecordWriter with something useful.  TextObjectFormat is tightly coupled to its own LineRecordWriter; there are no seams to inject a replacement.  Creating a RecordWriter that delegates to LineRecordWriter doesn't do any good, since the public interface is fine.  Similarly, you can't fix the problem by extending it, because the problem method is private.


That road blocked, an alternative is to have your type extend Text.  The good news is that writeObject isn't using the Text contract, but instead two methods promised by the BinaryComparable interface.  So we extend Text (getting past the instanceof check in LineRecordWriter), implement these two methods, and we are good to go.

Until somebody tries to write out a container with our object in it.  MapWritable cares a great deal about its keys and values being Writable, and in particular supporting Writable.write(DataOutput).  So we have to replace Text.write(DataOutput) as well, as the provided implementation doesn't seem to know anything about BinaryComparable.

And having done that, things are still broken until we correctly replace all of the Read methods.

But there is another possibility: the public interface of the Text object provides methods for copying into it's own buffer.  That doesn't save you much if you insist on writing the data to your own space first.  But it does give you enough to write an OutputStream that collects data in Text.buf.  And validateUTF8 allows you to enforce the UTF-8 constraint on the streamed data (but not, alas, provide it).

No comments:

Post a Comment