Front-end

Lexer

The lexer operates on the source code string, and produces a list of LexicalTokens and its subclasses. It is integrated in the Preprocessor. Together, they deal with translation phases 1-6.

Parser

The parser operates on the token list, and produces an abstract syntax tree of ASTNode instances. It also performs type checks, resolves scope, identifier names and other such semantic requirements.

It is mostly a hand-written recursive descent parser, with a precedence climbing parser integrated for expressions, which can be found in the ExpressionParser class.

Much of the parser is built out of loosely coupled classes that each handle parts of parsing the grammar (eg ParenMatcher handles everything about matching parenthesis, StatementParser parses function block statements, etc). Each of these has an associated interface (IExpressionParser for the ExpressionParser class), that is used for delegation.

Interfaces and Delegation

Many components in the compiler are decoupled: they exist as concrete classes that implement an associated interface. Users of the components are given an instance of the concrete classes, and they use it to immediately delegate the associated interfaces.

For example, when the DeclarationParser needs to parse an initializer expression, it implements the IExpressionParser interface, and delegates that job to a concrete ExpressionParser instance received through the constructor (or, in some cases, via lateinit properties, to resolve cyclic dependencies).

As a code example, the largest of the components, StatementParser, is declared like this:

class StatementParser(
    declarationParser: DeclarationParser,
    controlKeywordParser: ControlKeywordParser,
    constExprParser: ConstantExprParser
) : IStatementParser,
    ITokenHandler by declarationParser,
    IScopeHandler by declarationParser,
    IParenMatcher by declarationParser,
    IExpressionParser by controlKeywordParser,
    IConstantExprParser by constExprParser,
    IDeclarationParser by declarationParser,
    IControlKeywordParser by controlKeywordParser { /* ... */ }

As the code shows, StatementParser doesn’t receive 7 separate components, even though it delegates the implementation of 7 different interfaces. Since the DeclarationParser implements some of those interfaces itself, StatementParser only needs an instance of that for all of them.

This approach has several advantages:

Components are forced to be written against the exposed interfaces, allowing implementation details to be hidden in the concrete classes.
The individual interfaces are simple compared to a monolith approach, where every component would have access to every other component.
The dependencies between components are made explicit; for example, a ScopeHandler has no business using a ControlKeywordParser. Requiring manual delegation helps prevent accidental coupling.
The delegate syntax is clean: there is usually no need to write component.doThing(), rather doThing() can be called directly. This is most obvious in parser components using ITokenHandler, since they have lots (hundreds) of calls to functions like current, eat, isEaten or tokenContext. Without delegation, they’d end up polluting the source with tokenHandler.current() everywhere, which is not great for readability.

The `ITokenHandler` Interface

This interface allows a user to interact with a list of LexicalToken. It provides functions to process a token, move past it to the next one, search for the first token to meet a condition, or easily get debug data.

The terminology used in this interface’s methods relates to “eating” tokens. Eating a token means it was processed, consumed, or dealt with in some way, and further calls to the current function will return the next available token. As expected, isEaten returns true if there are no more tokens left.

It is used both in the parser, and in the preprocessor.

By far the most interesting feature, however, is the tokenContext function. One of the most common operations when parsing is having to create a “sub-parser” for certain nested data: the contents of a set of parenthesis in expressions, statement blocks for functions, argument lists for function calls or function prototypes, and many more. The tokenContext function takes an end index, and a lambda. A sublist is created, including the tokens from the current one, to the one specified by the end index. This is just a view into the larger list of tokens, so no array copies are made. The lambda is then executed. This is how context nesting works in an expression:

2 + (6 - 3 * (2 + 2) - ((7 * 7) / 2)) * 5
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Root context
2 + (6 - 3 * (2 + 2) - ((7 * 7) / 2)) * 5
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^      Outer paren's context (1st tier)
2 + (6 - 3 * (2 + 2) - ((7 * 7) / 2)) * 5
              ^^^^^     ^^^^^^^^^^^       Each inner paren has its own context,
                                          but they're at the same "tier", not
                                          nested (2nd tier)
2 + (6 - 3 * (2 + 2) - ((7 * 7) / 2)) * 5
                         ^^^^^            Finally, the inner-most paren has yet
                                          another context, nested 3 tiers deep

Eating tokens in nested contexts advances the parent contexts through them. On the diagram, eating the 7 * 7 tokens in the inner-most context will advance all the 3 levels above beyond these tokens (but no further!).

The context’s magic lies in the fact that the behaviour of the ITokenHandler’s functions is dynamic based on the token context. For example, if we’re parsing an expression such as 1 + (2 * 3 - 4) - 7, the parens will get their own context. Eating tokens (via eat or eatUntil) in this context will never eat tokens beyond the contents of the parens. In context, the isEaten function will return true after 4 was eaten, even if there are other tokens afterwards.

Most functions of the interface react to the context. As a result, parsing can be reasoned about in discrete chunks: each context deals with its own contents, and does not care what is being parsed in the parent contexts. Let’s say an error is encountered inside a context: 1 + (2 * 3 - ) - 7. There is a missing primary expression inside the paren. The expression parser notices, and consumes all the tokens in the context. However, this does not affect the outer expression’s: the interface provides no way to eat tokens beyond the ones allocated to the context, accidentally or otherwise.

Nested Declarator Parsing

Parsing declarators gets very complicated, very fast when nested declarators come into play. A typical declaration with nested declarators looks like:

int * (*f(int x))(double y)
^^^                         declaration specifiers
int * (*f(int x))(double y)
      ^^^^^^^^^^^           nested declarator
int * (*f(int x))(double y)
                 ^^^^^^^^^^ declarator suffix for non-nested declarator
int * (*f(int x))(double y)
         ^^^^^^^            declarator suffix for nested declarator
int * (*f(int x))(double y)
    ^                       indirection that "belongs" to the declaration specifiers
                            (from "int" to "pointer to int")
int * (*f(int x))(double y)
       ^                    indirection that "belongs" to the declarator suffix
                            (from "function" type to "pointer to function" type)
int * (*f(int x))(double y)
        ^                   designator for the resulting declaration
                            (ie the name of the function)

The example declaration declares a function called f, that takes one int parameter called x, and returns a pointer to a function that also takes one parameter, a double y, and returns a pointer to an int.

Indirection binds in reverse order of suffixes: the first indirection binds to the last suffix, and the last indirection binds to the first suffix. This reflects the declarator nesting.

Dereferencing the int pointer returned by calling the returned function pointer, in one expression, looks like this:

int result = *(f(1)(2.0));

Yes, this is why typedefs exist.

Errors

A compiler has to deal with two classes of “errors”; those that come from the code being compiled (“diagnostics”), and those that signal issues in the compiler itself.

All the relevant code can be found in the Diagnostics.kt file.

Diagnostics

Diagnostics are handled by the DebugHandler class (and its corresponding interface, IDebugHandler).

Printed diagnostics look like this:

  someFile.c:1:27: error: Expression is not assignable [Parser|EXPRESSION_NOT_ASSIGNABLE]
  int main() {int x; (x + 2) = 5;}
                      ~~~~~  ^

We generate Diagnostic instances using a simple DSL, but the error messages, the source extracts, and the caret/tilde markers are created lazily, internally. They are only computed when (or if) the diagnostic gets printed. This is useful, because it makes creating a Diagnostic instance relatively cheap. As a result, we can create diagnostics even if they might be discarded later, with little cost. One such place where diagnostics are sometimes discarded is ConstantExprParser#evaluateExpr.

An example of the IDebugHandler.diagnostic DSL, used in the parser:

// ...
diagnostic {
  id = DiagnosticId.INVALID_ARGUMENT_UNARY
  formatArgs(expr.type, op.op.s)
  errorOn(c..expr)
}
// ...

The range passed to errorOn in the example above, is an implementor of the SourcedRange interface.

The rangeTo operator is overloaded on SourcedRange to easily combine multiple ranges into one, like in this example from the parser:

sizeOf..tokenAt(rParenIdx)

The sizeOf token and the token returned by tokenAt(rParenIdx) are not adjacent (think of sizeof(int)), but this overload allows the parser to trivially create a compound SourcedRange to cover the entire sizeof expression.

Since LexicalToken and ASTNode are implementations of SourcedRange, compound ranges can be created by mixing and matching tokens and AST pieces (tok..node and node..tok are both valid syntax).

Another example, used in sequentialize:

// ...
diagnostic {
  id = DiagnosticId.UNSEQUENCED_MODS
  formatArgs(variable.name)
  for (mod in modList) when (mod) {
    is BinaryExpression -> errorOn(mod.lhs)
    is IncDecOperation -> errorOn(mod)
    else -> logger.throwICE("Modification doesn't modify anything") { mod }
  }
}
// ...

This illustrates the utility provided by using a lambda + builder DSL. Arbitrary code can run in the construction of the diagnostic, so the same diagnostic can be tailored to different situations.

Finally, an example from the preprocessor:

// ...
diagnostic {
  id = if (ignoreTrigraphs) DiagnosticId.TRIGRAPH_IGNORED else DiagnosticId.TRIGRAPH_PROCESSED
  if (!ignoreTrigraphs) formatArgs(replacement)
  columns(matchResult.start() until matchResult.end())
}
// ...

Even the kind of diagnostic can be dynamically selected based on arbitrary logic, which makes it easy to support feature flags like -fno-trigraphs in diagnostics.

Compiler Errors

For actual issues in the compiler, we use logging, the InternalCompilerError class, the throwICE extension methods on logger instances, and Kotlin’s stdlib functions from Preconditions.kt.

We use a Kotlin wrapper for slf4j’s API in common code. For the JVM, Log4j 2 is the logging backend.

Instances of ICE, IllegalArgumentException or IllegalStateException being thrown means invariants were violated, “impossible” situations occurred, or misuse of an API was encountered. As a result, these exceptions should not be caught anywhere: it is desirable for the application to crash if someone threw an ICE. Any one of these exceptions being thrown is an unfixed bug in the compiler.

Since the compiler is still a work in progress, there are many features/code paths that are not yet implemented. They generally do not throw ICEs, rather they use NotImplementedError created by the TODO function.