Thursday, October 24, 2013

Assembing a number line

Arrange the numbers 1-17 so that adjacent numbers sum to a perfect square.

Apparently, Matt Parker (@standupmaths) tweeted this, then Megan Schmidt followed up with a report of giving the problem to her remedial high schoolers; then Dan Meyer kicked that into my feed reader, and I had to go take a look.

After picking up my notebook and working through it, I asked myself the obvious question: why 17?  And of course it has the obvious answer -- because 18 doesn't work.

Then I looked at my solution again, and discovered that 14 doesn't work either.  In fact, it seems that 15 is the first number that has a solution.

So I did the obvious thing: I shared it with the guys at work.  The boss has an interview question of the flavor "show me how you would solve..."  and so we attacked it with that idea in mind.  After confusing ourselves on the whiteboard for a bit, we began to understand that numbers of the form 2n² are a problem, because they can't pair with themselves to produce (2n)²...

It's "just" a problem of finding a Hamiltonian path through a graph where each number is uniquely represented by a node, and edges exist only if the two connected nodes sum to a perfect square.  You don't add very many edges when you add a new node to the graph, so brute forcing it in code shouldn't be too difficult.

But where's the fun in that?

Wednesday, October 23, 2013

Text output in Map/Reduce

This evening's exercise was a spike, trying to evaluate whether I could get a significant improvement in task performance by exerting more control over the write of the data.

Conclusion: don't bother.

The job in question was a very thin one - take binary data, and write it out as human readable text.  In any significant map/reduce job, the actual mapping and reducing is going to swamp the time required to copy bytes of data around.

That said, there were interesting things to be learned in the exercise.

The actual logic of the write - the point where the data irrevocably passes from your key/value to your job output - is controlled by FileOutputFormat.getRecordWriter.  This is an abstract method; the implementation in this specific instance is provided by TextOutputFormat, which creates a TextOutputFormat.LineRecordWriter.  LineRecordWriter.write(K,V) in turn calls LineRecordWriter.writeObject.

Now the fun begins.  writeObject is specifically looking for Text; if it can't cast the Object to Text, then it's going to ensure that the UTF-8 requirement is satisfied by invoking Object.toString(), and then UTF-8 encoding the result.  In other words, TextOutputFormat is constraining the output to data that must be UTF-8 encoded.

For anything that's not a String to start with, that's going to be two copies; String.getBytes() promises to return a new byte array.

So that's not so good.

First thought would be to replace the LineRecordWriter with something useful.  TextObjectFormat is tightly coupled to its own LineRecordWriter; there are no seams to inject a replacement.  Creating a RecordWriter that delegates to LineRecordWriter doesn't do any good, since the public interface is fine.  Similarly, you can't fix the problem by extending it, because the problem method is private.

That road blocked, an alternative is to have your type extend Text.  The good news is that writeObject isn't using the Text contract, but instead two methods promised by the BinaryComparable interface.  So we extend Text (getting past the instanceof check in LineRecordWriter), implement these two methods, and we are good to go.

Until somebody tries to write out a container with our object in it.  MapWritable cares a great deal about its keys and values being Writable, and in particular supporting Writable.write(DataOutput).  So we have to replace Text.write(DataOutput) as well, as the provided implementation doesn't seem to know anything about BinaryComparable.

And having done that, things are still broken until we correctly replace all of the Read methods.

But there is another possibility: the public interface of the Text object provides methods for copying into it's own buffer.  That doesn't save you much if you insist on writing the data to your own space first.  But it does give you enough to write an OutputStream that collects data in Text.buf.  And validateUTF8 allows you to enforce the UTF-8 constraint on the streamed data (but not, alas, provide it).