Front-end
Lexer
The lexer operates on the source code string, and produces a list of LexicalTokens and its subclasses. It is integrated in the Preprocessor. Together, they deal with translation phases 1-6.
Parser
The parser operates on the token list, and produces an abstract syntax tree of ASTNode
instances.
It also performs type checks, resolves scope, identifier names and other such semantic requirements.
It is mostly a hand-written recursive descent parser, with a precedence climbing parser integrated for expressions, which can be found in the ExpressionParser class.
Much of the parser is built out of loosely coupled classes that each handle parts of parsing the grammar
(eg ParenMatcher handles everything about matching parenthesis, StatementParser parses function block
statements, etc).
Each of these has an associated interface (IExpressionParser
for the ExpressionParser
class), that is used for
delegation.
Interfaces and Delegation
Many components in the compiler are decoupled: they exist as concrete classes that implement an associated interface. Users of the components are given an instance of the concrete classes, and they use it to immediately delegate the associated interfaces.
For example, when the DeclarationParser
needs to parse an initializer
expression, it implements the IExpressionParser
interface, and delegates that
job to a concrete ExpressionParser
instance received through the constructor
(or, in some cases, via lateinit
properties, to resolve cyclic dependencies).
As a code example, the largest of the components, StatementParser
, is declared like this:
class StatementParser(
declarationParser: DeclarationParser,
controlKeywordParser: ControlKeywordParser,
constExprParser: ConstantExprParser
) : IStatementParser,
ITokenHandler by declarationParser,
IScopeHandler by declarationParser,
IParenMatcher by declarationParser,
IExpressionParser by controlKeywordParser,
IConstantExprParser by constExprParser,
IDeclarationParser by declarationParser,
IControlKeywordParser by controlKeywordParser { /* ... */ }
As the code shows, StatementParser
doesn’t receive 7 separate components, even
though it delegates the implementation of 7 different interfaces.
Since the DeclarationParser
implements some of those interfaces itself,
StatementParser
only needs an instance of that for all of them.
This approach has several advantages:
Components are forced to be written against the exposed interfaces, allowing implementation details to be hidden in the concrete classes.
The individual interfaces are simple compared to a monolith approach, where every component would have access to every other component.
The dependencies between components are made explicit; for example, a
ScopeHandler
has no business using aControlKeywordParser
. Requiring manual delegation helps prevent accidental coupling.The delegate syntax is clean: there is usually no need to write
component.doThing()
, ratherdoThing()
can be called directly. This is most obvious in parser components usingITokenHandler
, since they have lots (hundreds) of calls to functions likecurrent
,eat
,isEaten
ortokenContext
. Without delegation, they’d end up polluting the source withtokenHandler.current()
everywhere, which is not great for readability.
The ITokenHandler
Interface
This interface allows a user to interact with a list of LexicalToken
. It
provides functions to process a token, move past it to the next one, search for
the first token to meet a condition, or easily get debug data.
The terminology used in this interface’s methods relates to “eating” tokens.
Eating a token means it was processed, consumed, or dealt with in some way, and
further calls to the current
function will return the next available token. As
expected, isEaten
returns true if there are no more tokens left.
It is used both in the parser, and in the preprocessor.
By far the most interesting feature, however, is the tokenContext
function.
One of the most common operations when parsing is having to create a
“sub-parser” for certain nested data: the contents of a set of parenthesis in
expressions, statement blocks for functions, argument lists for function calls
or function prototypes, and many more.
The tokenContext
function takes an end index, and a lambda. A sublist is
created, including the tokens from the current one, to the one specified by the
end index. This is just a view into the larger list of tokens, so no array
copies are made. The lambda is then executed.
This is how context nesting works in an expression:
2 + (6 - 3 * (2 + 2) - ((7 * 7) / 2)) * 5
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Root context
2 + (6 - 3 * (2 + 2) - ((7 * 7) / 2)) * 5
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Outer paren's context (1st tier)
2 + (6 - 3 * (2 + 2) - ((7 * 7) / 2)) * 5
^^^^^ ^^^^^^^^^^^ Each inner paren has its own context,
but they're at the same "tier", not
nested (2nd tier)
2 + (6 - 3 * (2 + 2) - ((7 * 7) / 2)) * 5
^^^^^ Finally, the inner-most paren has yet
another context, nested 3 tiers deep
Eating tokens in nested contexts advances the parent contexts through them. On
the diagram, eating the 7 * 7
tokens in the inner-most context will advance
all the 3 levels above beyond these tokens (but no further!).
The context’s magic lies in the fact that the behaviour of the ITokenHandler
’s
functions is dynamic based on the token context. For example, if we’re parsing
an expression such as 1 + (2 * 3 - 4) - 7
, the parens will get their own
context. Eating tokens (via eat
or eatUntil
) in this context will never eat
tokens beyond the contents of the parens. In context, the isEaten
function
will return true after 4
was eaten, even if there are other tokens afterwards.
Most functions of the interface react to the context. As a result, parsing can
be reasoned about in discrete chunks: each context deals with its own contents,
and does not care what is being parsed in the parent contexts. Let’s say an
error is encountered inside a context: 1 + (2 * 3 - ) - 7
. There is a missing
primary expression inside the paren. The expression parser notices, and consumes
all the tokens in the context. However, this does not affect the outer
expression’s: the interface provides no way to eat tokens beyond the ones
allocated to the context, accidentally or otherwise.
Nested Declarator Parsing
Parsing declarators gets very complicated, very fast when nested declarators come into play. A typical declaration with nested declarators looks like:
int * (*f(int x))(double y)
^^^ declaration specifiers
int * (*f(int x))(double y)
^^^^^^^^^^^ nested declarator
int * (*f(int x))(double y)
^^^^^^^^^^ declarator suffix for non-nested declarator
int * (*f(int x))(double y)
^^^^^^^ declarator suffix for nested declarator
int * (*f(int x))(double y)
^ indirection that "belongs" to the declaration specifiers
(from "int" to "pointer to int")
int * (*f(int x))(double y)
^ indirection that "belongs" to the declarator suffix
(from "function" type to "pointer to function" type)
int * (*f(int x))(double y)
^ designator for the resulting declaration
(ie the name of the function)
The example declaration declares a function called f
, that takes one int
parameter called x
, and returns a pointer to a function that also takes one
parameter, a double y
, and returns a pointer to an int.
Indirection binds in reverse order of suffixes: the first indirection binds to the last suffix, and the last indirection binds to the first suffix. This reflects the declarator nesting.
Dereferencing the int pointer returned by calling the returned function pointer, in one expression, looks like this:
int result = *(f(1)(2.0));
Yes, this is why typedefs exist.
Errors
A compiler has to deal with two classes of “errors”; those that come from the code being compiled (“diagnostics”), and those that signal issues in the compiler itself.
All the relevant code can be found in the Diagnostics.kt file.
Diagnostics
Diagnostics are handled by the DebugHandler
class (and its corresponding interface, IDebugHandler
).
Printed diagnostics look like this:
someFile.c:1:27: error: Expression is not assignable [Parser|EXPRESSION_NOT_ASSIGNABLE]
int main() {int x; (x + 2) = 5;}
~~~~~ ^
We generate Diagnostic
instances using a simple DSL, but the error messages,
the source extracts, and the caret/tilde markers are created lazily, internally.
They are only computed when (or if) the diagnostic gets printed. This is useful,
because it makes creating a Diagnostic
instance relatively cheap. As a result,
we can create diagnostics even if they might be discarded later, with little
cost. One such place where diagnostics are sometimes discarded is
ConstantExprParser#evaluateExpr
.
An example of the IDebugHandler.diagnostic
DSL, used in the parser:
// ...
diagnostic {
id = DiagnosticId.INVALID_ARGUMENT_UNARY
formatArgs(expr.type, op.op.s)
errorOn(c..expr)
}
// ...
The range passed to errorOn
in the example above, is an implementor of the SourcedRange
interface.
The rangeTo
operator is overloaded on SourcedRange
to easily combine multiple ranges into one, like in this example
from the parser:
sizeOf..tokenAt(rParenIdx)
The sizeOf
token and the token returned by tokenAt(rParenIdx)
are not
adjacent (think of sizeof(int)
), but this overload allows the parser to
trivially create a compound SourcedRange
to cover the entire sizeof
expression.
Since LexicalToken
and ASTNode
are implementations of SourcedRange
,
compound ranges can be created by mixing and matching tokens and AST pieces
(tok..node
and node..tok
are both valid syntax).
Another example, used in sequentialize
:
// ...
diagnostic {
id = DiagnosticId.UNSEQUENCED_MODS
formatArgs(variable.name)
for (mod in modList) when (mod) {
is BinaryExpression -> errorOn(mod.lhs)
is IncDecOperation -> errorOn(mod)
else -> logger.throwICE("Modification doesn't modify anything") { mod }
}
}
// ...
This illustrates the utility provided by using a lambda + builder DSL. Arbitrary code can run in the construction of the diagnostic, so the same diagnostic can be tailored to different situations.
Finally, an example from the preprocessor:
// ...
diagnostic {
id = if (ignoreTrigraphs) DiagnosticId.TRIGRAPH_IGNORED else DiagnosticId.TRIGRAPH_PROCESSED
if (!ignoreTrigraphs) formatArgs(replacement)
columns(matchResult.start() until matchResult.end())
}
// ...
Even the kind of diagnostic can be dynamically selected based on arbitrary
logic, which makes it easy to support feature flags like -fno-trigraphs
in
diagnostics.
Compiler Errors
For actual issues in the compiler, we use logging, the
InternalCompilerError
class, the throwICE
extension methods on logger
instances, and Kotlin’s stdlib functions from Preconditions.kt
.
We use a Kotlin wrapper for slf4j’s API in common code. For the JVM, Log4j 2 is the logging backend.
Instances of ICE, IllegalArgumentException
or IllegalStateException
being thrown means
invariants were violated, “impossible” situations occurred, or misuse of an API
was encountered. As a result, these exceptions should not be caught anywhere: it
is desirable for the application to crash if someone threw an ICE. Any one of
these exceptions being thrown is an unfixed bug in the compiler.
Since the compiler is still a work in progress, there are many features/code
paths that are not yet implemented. They generally do not throw ICEs, rather
they use NotImplementedError
created by the TODO
function.