Code formatting is popular since gofmt and clang-format, although the are around for a long time (see cb 1979 and indent 1983). Every programming language should have such a tool, because it reduces arguments about code style. Writing such a tool is not trivial. Here is the general approach.

Actually there are two approaches. The question is: Do you parse the input (gofmt) or not (clang-format)? Parsing means you can do better layouting due to better information, but sometimes you would like to format code, which does not parse successfully.

Do Parse the Input

The overall approach is simple. Reuse the parser from the compiler to construct an AST and serialize the AST to code again. The gofmt source is only 238 lines.

Well, not so fast. A code formatter must preserve comments in the code. Consider this case:

void f(int /* comment */ /* foo */ x);

The compiler can throw away comments during lexical analysis, so the compiler AST contains no comments. Thus, it is not a simple reuse. You have to extend parser and AST for formatting.

Compared to the other approach, there is one advantage: You can do semantic analysis. For example, in a language like C:

a * b;

To layout this correctly, you need to know if a is a type or a variable. Naturally, clang-format (3.6) fails. If you have not enough context information, you can only guess.

Do NOT Parse the Input

You can reuse the lexer of the compiler, but you cannot use the parser. You work on a token stream, not on an AST.

The advantage of this approach is flexibility. You can format programs with "broken" syntax. For example, parts of a program within a diff or within the editor where the programmer has not finished typing. A normal C parser cannot handle case without switch.

case 42:
    foo(); break;

While you cannot reuse a parser library from a compiler, you have to do some parsing to distinguish between cases. For example, : requires a space in front if it occurs in a ternary operator, but not in a case declaration. You have to write a special parser for formatting, so "not parsing" was kind of a lie. The special parser must be able to start parsing at any point. You can probably assume that you start with a statement.

A* for Layouting

No matter if you parse or not, where you insert which whitespace is an important problem. Whitespace means line breaks, indentation, and spaces in between.

Your (special) parser must give you a sequence of "lines" (note the quotation marks). A "line" is actually a single line in the output, if the length is not limited. There is always a line break between "lines" and possibly within "lines". How to partition already depends on the configuration. For example, { might be a separate "line" (Allman style) or part of the previous one (K&R style). Since you usually want to limit lines to 80 characters, we need to do line breaking within "lines", which I call layouting.

Now that you have all the information you need about your tokens, the actual formatting begins. Basically, the task is to split "lines" into lines. Realize, you have to solve a complex optimization problem. You should probably read about the Knuth Plass Line Breaking Algorithm, which TeX uses. However, code formatting does have more constraints, because indentations and expressions-within matter as well. Think about how you can fix the following declaration assuming a 22 character line length limit.

void f(int x, int (f)(
       int x), int y)

You have to pursue multiple possibilities to find the best layout. Effectively, it comes to an A* search algorithm. You think like "if I split the line early here, then maybe I can split better later". Then you store both possibilities in some priority queue and continue with the cheapest variant in the queue. There are penalties for each line break, for going over the maximum line length, for splitting expressions depending on nesting, and other factors.

Provide Lots of Config Options

Since syntax layout is a highly subjective topic, there are lots of opinions (tabs vs spaces). Your tool needs lots of knobs to tune. The common use case is to have a config file at the root of a repository, then use hooks or continuous integration to enforce them.

Ok, if you are working on a new language, maybe you can enforce the One Official Style and need no configuration. Google Go did this very successfully.

© 2015-08-22