I've taken, with a tiny bit of prodding, the Token Stream Diffs Using Pygments from toy to a nascent toolchain that may even almost be useful.

I've brought in two new dependencies Google diff-match-patch and, for nice argument parsing, argparse. The diff-match-patch library provides a character-based diff algorithm and patch format (a character-based unidiff-like format with character escaping) in a number of languages, including my friend Python. I can use diff-match-patch to produce useful patch output (and apply said patches with a simple new tokpatch.py file that is but a wrapper around diff-match-patch patching).

tokdiff.py has grown three new output formats. The original "toy" format I've renamed "verbose" and its quite interesting for debugging and getting an idea of why diffs look the way they do. Most useful, and the new default, is the unidiff-like output. There's also diff-match-patch's much more compact tab-delimited "delta" format, which is interesting, but I don't think is all that safe. (It's an undocumented, outside of the code itself, feature...)

The final output format is the "compare" which outputs some pretty HTML visually showing the differences between the tokenized diff approach and diff-match-patch's standard character-based diff, plus some basic benchmarking of the two algorithms.

Both tools and both dependencies can be grabbed from the darcs repository:

darcs get http://repos.worldmaker.net/tokdiff/main tokdiff

I'll consider putting together a deeper code site for it in the near future.

Some brief observations and thoughts for future directions: