First off, as with my answer to your previous question, I last worked on this code in 2012 so my memory is likely faulty. I’ll make a best effort though.
We called the incremental parse tree rebuilding algorithm “the blender” because it blends together a new parse with the old one, but also because it makes my brain feel like it’s in a blender whenever I try to figure out how it works. This algorithm was based on Neal Gafter’s PhD research, so it’s a good thing we had Neal on the team at the time. :)
The first thing to note is that the parse tree we hand out via the Roslyn API when you ask for one is NOT the parse tree generated by the parser/blender. It’s a thin wrapper over that parse tree. We called these “red/green trees” because those were the colors of the whiteboard ma…
First off, as with my answer to your previous question, I last worked on this code in 2012 so my memory is likely faulty. I’ll make a best effort though.
We called the incremental parse tree rebuilding algorithm “the blender” because it blends together a new parse with the old one, but also because it makes my brain feel like it’s in a blender whenever I try to figure out how it works. This algorithm was based on Neal Gafter’s PhD research, so it’s a good thing we had Neal on the team at the time. :)
The first thing to note is that the parse tree we hand out via the Roslyn API when you ask for one is NOT the parse tree generated by the parser/blender. It’s a thin wrapper over that parse tree. We called these “red/green trees” because those were the colors of the whiteboard markers we had on hand when designing the algorithm.
The parser actually generates “green” trees. Green trees are immutable persistent trees. Immutable because obviously, you can’t change them. Persistent because if you have some fragment of a green tree, you can re-use it to make a different green tree.
Green trees are concrete syntax trees, not abstract syntax trees. The exact source code can be entirely deduced from the contents of the syntax tree because each node holds on to the tokens that it was parsed from, and those tokens hang on to their “trivia” – their leading and trailing whitespace and comments. This is important for two reasons. The first reason is that refactoring tools, linters, code colorizers, code formatters, and so on, often need to know the details of the trivia. The second is because of how we manage persistence.
A consequence of persistence – re-usability – is that nodes in the green tree must not know where in the source code they came from! If every node in a green tree knows its exact position in the original source code, then making an edit from
class C {
... a thousand lines of code here
}
to
class C
{
... a thousand lines of code here
}
would have to rebuild the entire parse tree after the C, even though nothing of consequence changed.
What we do instead is: “green” lexes and parses keep track of the width of every token or parse tree node but not its position. The position can be deduced from the width in the obvious way, and that’s what the red tree does. The red tree is a thin, allocated-on-demand wrapper over the green tree, and it does the position calculations.
So that’s the key insight: the choice that enables incremental lexing and parsing is to make the lex and parse state “green” – immutable, full-fidelity to source code, and tracking widths but not positions.
Once you’ve got that, the blender algorithm is at a very high level:
- Lex and parse the whole thing to get a baseline.
- On each edit, Visual Studio sends Roslyn a data structure that represents the edit – which file changed, where in the file the change was, and so on.
- From the width data in the green tokens, the blender can figure out which tokens in the stream were definitely inside the region edited and which tokens might be inside that region. (Because if the edit was a deletion, say, we might have just joined together two blobs of text that used to be two tokens but are now one.)
- Now we know which portions of the token stream are “dirty” and what portions of the source code they correspond to, so we can reset the lexer state to just before the dirty spot, lex only as much of the file as we need to, and patch a new token stream in. The lexer doesn’t know or care, the lexer is just being asked to lex some char sequence like it always does.
- Now we have the same problem but for the parser. We now know the diff between the old and new token streams and can use the width data to deduce which portions of the parse tree are dirty and which are not.
- The parser is asked to re-parse only the portions of the token stream that it needs to. Like the lexer, the recursive descent parser doesn’t care that the blender is going to do stuff to the parse tree it returns. The parser just does what it says on the tin: you give it a position in a token stream and ask it to parse a class definition or whatever, and it does that and hands back a green tree.
- The blender then rebuilds the spine of the parse tree with the new green tree replacing the old one.
The blender API source code is here:
https://github.com/dotnet/roslyn/blob/main/src/Compilers/CSharp/Portable/Parser/Blender.cs
But the real work is done in the “reader”, which produces a sequence of “blended” tokens or parse tree nodes, as requested:
https://github.com/dotnet/roslyn/blob/main/src/Compilers/CSharp/Portable/Parser/Blender.Reader.cs
One of the first tools I wrote for myself after Neal got the blender working was a winforms app that displayed the green tree state before and after an edit so that I could make sense of what the blender was doing. It really does make my head hurt, so best of luck trying to understand it!