.. include:: _config.rst Front-end ========= Lexer ----- The `lexer`_ operates on the source code string, and produces a list of `LexicalTokens`_ and its subclasses. It is integrated in the `Preprocessor`_. Together, they deal with translation phases 1-6. .. _lexer: https://github.com/slak44/ckompiler/tree/master/src/main/kotlin/slak/ckompiler/lexer/LexicalElements.kt .. _LexicalTokens: https://github.com/slak44/ckompiler/tree/master/src/main/kotlin/slak/ckompiler/lexer/TokenModel.kt .. _Preprocessor: https://github.com/slak44/ckompiler/tree/master/src/main/kotlin/slak/ckompiler/lexer/Preprocessor.kt Parser ------ The `parser`_ operates on the token list, and produces an `abstract syntax tree`_ of `ASTNode` instances. It also performs type checks, resolves scope, identifier names and other such semantic requirements. It is mostly a hand-written recursive descent parser, with a `precedence climbing`_ parser integrated for expressions, which can be found in the `ExpressionParser`_ class. Much of the parser is built out of loosely coupled classes that each handle parts of parsing the grammar (eg `ParenMatcher`_ handles everything about matching parenthesis, `StatementParser`_ parses function block statements, etc). Each of these has an associated interface (`IExpressionParser` for the `ExpressionParser` class), that is used for delegation. .. _parser: https://github.com/slak44/ckompiler/tree/master/src/main/kotlin/slak/ckompiler/parser/ .. _abstract syntax tree: https://github.com/slak44/ckompiler/tree/master/src/main/kotlin/slak/ckompiler/parser/SyntaxTreeModel.kt .. _precedence climbing: https://en.wikipedia.org/wiki/Operator-precedence_parser#Precedence_climbing_method .. _ExpressionParser: https://github.com/slak44/ckompiler/tree/master/src/main/kotlin/slak/ckompiler/parser/ExpressionParser.kt .. _ParenMatcher: https://github.com/slak44/ckompiler/tree/master/src/main/kotlin/slak/ckompiler/parser/ParenMatcher.kt .. _StatementParser: https://github.com/slak44/ckompiler/tree/master/src/main/kotlin/slak/ckompiler/parser/StatementParser.kt Interfaces and Delegation ^^^^^^^^^^^^^^^^^^^^^^^^^ Many components in the compiler are decoupled: they exist as concrete classes that implement an associated interface. Users of the components are given an instance of the concrete classes, and they use it to immediately delegate the associated interfaces. For example, when the `DeclarationParser` needs to parse an initializer expression, it implements the `IExpressionParser` interface, and delegates that job to a concrete `ExpressionParser` instance received through the constructor (or, in some cases, via `lateinit` properties, to resolve cyclic dependencies). As a code example, the largest of the components, `StatementParser`, is declared like this: .. code-block:: kotlin class StatementParser( declarationParser: DeclarationParser, controlKeywordParser: ControlKeywordParser, constExprParser: ConstantExprParser ) : IStatementParser, ITokenHandler by declarationParser, IScopeHandler by declarationParser, IParenMatcher by declarationParser, IExpressionParser by controlKeywordParser, IConstantExprParser by constExprParser, IDeclarationParser by declarationParser, IControlKeywordParser by controlKeywordParser { /* ... */ } As the code shows, `StatementParser` doesn't receive 7 separate components, even though it delegates the implementation of 7 different interfaces. Since the `DeclarationParser` implements some of those interfaces itself, `StatementParser` only needs an instance of that for all of them. This approach has several advantages: 1. Components are forced to be written against the exposed interfaces, allowing implementation details to be hidden in the concrete classes. 2. The individual interfaces are simple compared to a monolith approach, where every component would have access to every other component. 3. The dependencies between components are made explicit; for example, a `ScopeHandler` has no business using a `ControlKeywordParser`. Requiring manual delegation helps prevent accidental coupling. 4. The delegate syntax is clean: there is usually no need to write `component.doThing()`, rather `doThing()` can be called directly. This is most obvious in parser components using `ITokenHandler`, since they have lots (hundreds) of calls to functions like `current`, `eat`, `isEaten` or `tokenContext`. Without delegation, they'd end up polluting the source with `tokenHandler.current()` everywhere, which is not great for readability. The `ITokenHandler` Interface ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ This interface allows a user to interact with a list of `LexicalToken`. It provides functions to process a token, move past it to the next one, search for the first token to meet a condition, or easily get debug data. The terminology used in this interface's methods relates to "eating" tokens. Eating a token means it was processed, consumed, or dealt with in some way, and further calls to the `current` function will return the next available token. As expected, `isEaten` returns true if there are no more tokens left. It is used both in the parser, and in the preprocessor. By far the most interesting feature, however, is the `tokenContext` function. One of the most common operations when parsing is having to create a "sub-parser" for certain nested data: the contents of a set of parenthesis in expressions, statement blocks for functions, argument lists for function calls or function prototypes, and many more. The `tokenContext` function takes an end index, and a lambda. A sublist is created, including the tokens from the current one, to the one specified by the end index. This is just a view into the larger list of tokens, so no array copies are made. The lambda is then executed. This is how context nesting works in an expression: .. code-block:: text 2 + (6 - 3 * (2 + 2) - ((7 * 7) / 2)) * 5 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Root context 2 + (6 - 3 * (2 + 2) - ((7 * 7) / 2)) * 5 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Outer paren's context (1st tier) 2 + (6 - 3 * (2 + 2) - ((7 * 7) / 2)) * 5 ^^^^^ ^^^^^^^^^^^ Each inner paren has its own context, but they're at the same "tier", not nested (2nd tier) 2 + (6 - 3 * (2 + 2) - ((7 * 7) / 2)) * 5 ^^^^^ Finally, the inner-most paren has yet another context, nested 3 tiers deep Eating tokens in nested contexts advances the parent contexts through them. On the diagram, eating the `7 * 7` tokens in the inner-most context will advance all the 3 levels above beyond these tokens (but no further!). The context's magic lies in the fact that the behaviour of the `ITokenHandler`'s functions is dynamic based on the token context. For example, if we're parsing an expression such as `1 + (2 * 3 - 4) - 7`, the parens will get their own context. Eating tokens (via `eat` or `eatUntil`) in this context will never eat tokens beyond the contents of the parens. In context, the `isEaten` function will return true after `4` was eaten, even if there are other tokens afterwards. Most functions of the interface react to the context. As a result, parsing can be reasoned about in discrete chunks: each context deals with its own contents, and does not care what is being parsed in the parent contexts. Let's say an error is encountered inside a context: `1 + (2 * 3 - ) - 7`. There is a missing primary expression inside the paren. The expression parser notices, and consumes all the tokens in the context. However, this does not affect the outer expression's: the interface provides no way to eat tokens beyond the ones allocated to the context, accidentally or otherwise. Nested Declarator Parsing ^^^^^^^^^^^^^^^^^^^^^^^^^ Parsing declarators gets very complicated, very fast when nested declarators come into play. A typical declaration with nested declarators looks like: .. code-block:: text int * (*f(int x))(double y) ^^^ declaration specifiers int * (*f(int x))(double y) ^^^^^^^^^^^ nested declarator int * (*f(int x))(double y) ^^^^^^^^^^ declarator suffix for non-nested declarator int * (*f(int x))(double y) ^^^^^^^ declarator suffix for nested declarator int * (*f(int x))(double y) ^ indirection that "belongs" to the declaration specifiers (from "int" to "pointer to int") int * (*f(int x))(double y) ^ indirection that "belongs" to the declarator suffix (from "function" type to "pointer to function" type) int * (*f(int x))(double y) ^ designator for the resulting declaration (ie the name of the function) The example declaration declares a function called `f`, that takes one int parameter called `x`, and returns a pointer to a function that also takes one parameter, a double `y`, and returns a pointer to an int. Indirection binds in reverse order of suffixes: the first indirection binds to the last suffix, and the last indirection binds to the first suffix. This reflects the declarator nesting. Dereferencing the int pointer returned by calling the returned function pointer, in one expression, looks like this: .. code-block:: text int result = *(f(1)(2.0)); Yes, this is why typedefs exist. Errors ------ A compiler has to deal with two classes of "errors"; those that come from the code being compiled ("diagnostics"), and those that signal issues in the compiler itself. All the relevant code can be found in the `Diagnostics.kt`_ file. .. _Diagnostics.kt: https://github.com/slak44/ckompiler/tree/master/src/main/kotlin/slak/ckompiler/Diagnostics.kt Diagnostics ^^^^^^^^^^^ Diagnostics are handled by the `DebugHandler` class (and its corresponding interface, `IDebugHandler`). Printed diagnostics look like this: .. raw:: html
    someFile.c:1:27: error: Expression is not assignable [Parser|EXPRESSION_NOT_ASSIGNABLE]
    int main() {int x; (x + 2) = 5;}
                        ~~~~~  ^
    
We generate `Diagnostic` instances using a simple DSL, but the error messages, the source extracts, and the caret/tilde markers are created lazily, internally. They are only computed when (or if) the diagnostic gets printed. This is useful, because it makes creating a `Diagnostic` instance relatively cheap. As a result, we can create diagnostics even if they might be discarded later, with little cost. One such place where diagnostics are sometimes discarded is `ConstantExprParser#evaluateExpr`. An example of the `IDebugHandler.diagnostic` DSL, used in the parser: .. code-block:: kotlin // ... diagnostic { id = DiagnosticId.INVALID_ARGUMENT_UNARY formatArgs(expr.type, op.op.s) errorOn(c..expr) } // ... The range passed to `errorOn` in the example above, is an implementor of the `SourcedRange` interface. The `rangeTo` operator is overloaded on `SourcedRange` to easily combine multiple ranges into one, like in this example from the parser: .. code-block:: kotlin sizeOf..tokenAt(rParenIdx) The `sizeOf` token and the token returned by `tokenAt(rParenIdx)` are not adjacent (think of `sizeof(int)`), but this overload allows the parser to trivially create a compound `SourcedRange` to cover the entire sizeof expression. Since `LexicalToken` and `ASTNode` are implementations of `SourcedRange`, compound ranges can be created by mixing and matching tokens and AST pieces (`tok..node` and `node..tok` are both valid syntax). Another example, used in `sequentialize`: .. code-block:: kotlin // ... diagnostic { id = DiagnosticId.UNSEQUENCED_MODS formatArgs(variable.name) for (mod in modList) when (mod) { is BinaryExpression -> errorOn(mod.lhs) is IncDecOperation -> errorOn(mod) else -> logger.throwICE("Modification doesn't modify anything") { mod } } } // ... This illustrates the utility provided by using a lambda + builder DSL. Arbitrary code can run in the construction of the diagnostic, so the same diagnostic can be tailored to different situations. Finally, an example from the preprocessor: .. code-block:: kotlin // ... diagnostic { id = if (ignoreTrigraphs) DiagnosticId.TRIGRAPH_IGNORED else DiagnosticId.TRIGRAPH_PROCESSED if (!ignoreTrigraphs) formatArgs(replacement) columns(matchResult.start() until matchResult.end()) } // ... Even the kind of diagnostic can be dynamically selected based on arbitrary logic, which makes it easy to support feature flags like `-fno-trigraphs` in diagnostics. Compiler Errors ^^^^^^^^^^^^^^^ For actual issues in the compiler, we use logging, the `InternalCompilerError` class, the `throwICE` extension methods on logger instances, and Kotlin's stdlib functions from `Preconditions.kt`. We use `a Kotlin wrapper `_ for slf4j's API in common code. For the JVM, `Log4j 2 `_ is the logging backend. Instances of ICE, `IllegalArgumentException` or `IllegalStateException` being thrown means invariants were violated, "impossible" situations occurred, or misuse of an API was encountered. As a result, these exceptions should not be caught anywhere: it is desirable for the application to crash if someone threw an ICE. Any one of these exceptions being thrown is an unfixed bug in the compiler. Since the compiler is still a work in progress, there are many features/code paths that are not yet implemented. They generally do not throw ICEs, rather they use `NotImplementedError` created by the `TODO` function.