Note: This guide is still very much incomplete! It is not yet, as advertised, "All Things Diffs."

This page serves as a guide to bits and pieces of information about diffs. It's not comprehensive, but should be useful still for those working on diff parsing.

Diff File Structure

Today, the most common type of diffs are "unified diffs." These are what people are most familiar with. Despite what people think, though, there's no standard here. There's a bunch of different variations on unified diffs.

Encoding

There is absolutely no information available in diffs to indicate the encoding of the file. They may be UTF-8, or latin1, or anything else. The diff content presented may be based on the local encoding from the environment where the diff was generated, or it may be based on the encoding used for the file in a repository. The parser has to deal with this.

Parsers therefore always treat these files as a collection of bytes, ignoring the encodings (generally speaking... They may have to do certain things with Git filenames). This is somewhat fine during the parsing stage, but if the file contents (data or filenames) are to be rendered later, then that still has to be dealt with.

Note that some SCMs may decide all content is UTF-8 (or another encoding).

Common Diff Data

All unified diffs contain the following structure for the actual patchable data: