Programming is the best antidote to arrogance I’ve come across — I make so many errors that I am continually reminded of my own fallibility. Broadly speaking, I think of errors as severe or minor. Severe errors are where I have fundamentally misunderstood something about the system I am creating. Each severe error is a bespoke problem, often requiring custom tooling to understand and fix it. Minor errors, in contrast, are repetitive and quickly fixed. However, they’re also much more numerous than severe errors: even shaving a couple of seconds off of the time it takes a programmer to fix a class of minor errors is worthwhile when you consider how often they occur.
The most minor of minor errors, and also I suspect the most frequent, are syntax errors. They occur for three main re…
Programming is the best antidote to arrogance I’ve come across — I make so many errors that I am continually reminded of my own fallibility. Broadly speaking, I think of errors as severe or minor. Severe errors are where I have fundamentally misunderstood something about the system I am creating. Each severe error is a bespoke problem, often requiring custom tooling to understand and fix it. Minor errors, in contrast, are repetitive and quickly fixed. However, they’re also much more numerous than severe errors: even shaving a couple of seconds off of the time it takes a programmer to fix a class of minor errors is worthwhile when you consider how often they occur.
The most minor of minor errors, and also I suspect the most frequent, are syntax errors. They occur for three main reasons: mental sloppiness; inaccurate typing1; or an incomplete understanding of the language’s syntax. The latter case is generally part of a brief-ish learning phase we go through and I’m not sure what a good solution for it might look like. The former two cases, however, are extremely common. When I’ve made a small typo, what I want is the parser in my compiler or IDE to pinpoint the location of the syntax error accurately and then recover from it and continue as if I hadn’t made an error at all. Since compilation is often far from instantaneous, and I often make multiple errors (not just syntax errors), good quality syntax error recovery improves my programming efficiency.
Unfortunately, LR parsers – of which I am particularly fond – have a poor reputation for syntax error recovery. I’m going to show in this article that this isn’t inevitable, and that it’s possible to do surprisingly good automatic syntax error recovery for any LR grammar. If you want to know more details, you might be interested in the paper Lukas Diekmann and I recently published called Don’t Panic! Better, Fewer, Syntax Errors for LR Parsers. The paper also has a fairly brief accompanying talk, if you find that sort of thing helpful:
For everyone else, let’s continue. To make our lives easier, for the rest of this article I’m going to shorten “syntax error recovery” to “error recovery”.
Outlining the problem
Let’s see error recovery in action in a widely used compiler — javac:
[Click to start/stop animation]
As a quick reminder, ‘int x y;’ isn’t valid Java syntax. javac correctly detects a syntax error after the ‘x’ token and then tries to recover from that error. Since ‘int x;’ is valid Java, javac assumes that I meant to put a semi-colon after ‘x’, repairs my input accordingly, and continues parsing. This is the good side of error recovery: my silly syntax error hasn’t stopped the compiler from carrying on its good work. However, the bad side of error recovery is immediately apparent: ’ y;’ isn’t valid Java, so javac immediately prints out a second, spurious, syntax error that isn’t any fault of mine.
Of course, I have deliberately picked an example where javac does a poor job but I regret to inform you that it didn’t take me very long to find it. Many parsers do such a poor job of error recovery that experienced programmers often scroll back to the location of the first syntax error, ignoring both its repair and any subsequent syntax errors. Instead of being helpful, error recovery can easily have the opposite effect, slowing us down as we look for the real error amongst a slew of spurious errors.
Let’s look at the modern compiler that I most often use as an exemplar of good error messages, rustc. It often does a good job in the face of syntax errors:
[Click to start/stop animation]
However, even rustc can be tripped up when presented with simple syntax errors:
[Click to start/stop animation]
Some language implementations don’t even bother trying to recover from syntax errors. For example, even if I make two easily fixable syntax errors in a file, CPython stops as soon as it encounters the first:
[Click to start/stop animation]
As all this might suggest, error recovery is hard to do well, and it’s unlikely that it will ever fully match human intuition about how syntax errors should be fixed. The root of the problem is that when we hit an error while parsing, there are, in general, an infinite number of ways that we could take to try and get parsing back on track. Since exploring infinity takes a while, error recovery has to use heuristics of one sort or another.
The more knowledge a parser has of the language it is parsing, the more refined that heuristic can be. Hand-written parsers have a fundamental advantage here, because one can add as much knowledge about the language’s semantics as one wants to the parser. However, extending a hand-written parser in this way is no small task, especially for languages with large grammars. It’s difficult to get precise figures, but I’ve seen more than one parser that has taken a small number of person years of effort, much of which is devoted to error recovery. Not many of us have that much time to devote to the problem.
Automatically generated parsers, in contrast, are at a clear disadvantage: their only knowledge of the language is that expressed via its grammar. Despite that, automatically generated LL parsers are often able to do a tolerable job of error recovery2.
Unfortunately, LR parsers have a not undeserved reputation for doing a poor job of error recovery. Yacc, for example, requires users to sprinkle error tokens throughout their grammar in order to have error recovery in the resulting parser: I think I’ve only seen one real grammar which makes use of this feature, and I am sceptical that it can be made to work well. Panic mode is a fully automatic approach to error recovery in LR parsing, but it works by gradually deleting the parsing stack, causing it to delete input before a syntax error in order to try and recover after it. Frankly, panic mode’s repairs are so bad that I think on a modern machine3 it’s worse than having no error recovery at all.
The roots of a solution
At a certain point when working on grmtools I realised that I should think about error recovery, something which I had only ever encountered as a normal user. A quick word about grmtools: it’s intended as a collection of parsing related libraries in Rust. At the moment, the parts that are most useful to users are lrpar – a Yacc-compatible parser – and, to a lesser extent4, lrlex – a Lex-ish lexer. For the rest of this article, I’ll almost exclusively be talking about lrpar, as that’s the part concerned with error recovery.
Fortunately for me, I quickly came across Carl Cerecke’s PhD thesis which opened my eyes to an entirely different way of doing error recovery. His thesis rewards careful reading, and has some very good ideas in it. Ultimately I realised that Cerecke’s thesis is a member of what these days I call the Fischer et al. family of error recovery algorithms for LR parsing, since they all trace their lineage back to that paper.
When error recovery algorithms in the Fischer et al. family encounter a syntax error they try to find a repair sequence that, when applied to the input, gets parsing back on track. Different algorithms have different repairs at their disposal and different mechanisms for creating a repair sequence. For example, we ended up using the approach of Corchuelo et al. — one of the most recent members of the Fischer et al. family — as our intellectual base.
CPCT+ in use
We took the Corchuelo et al. algorithm, fixed and semi-formalised it5, and extended it to produce a new error recovery algorithm CPCT+ that is now part of lrpar. We can use nimbleparse — a simple command-line grammar debugging tool — to see CPCT+ in action on our original Java example:
[Click to start/stop animation]
As with javac’s error recovery, CPCT+ is started when lrpar encounters a syntax error at the ‘y’ token. Unlike javac, CPCT+ presents 3 different repair sequences (numbered 1, 2, 3) to the user which, in order6, would repair the input to be equivalent to: ‘int x, y;’, ‘int x = y;’, or ‘int x;’. Importantly, repair sequences can contain multiple repairs:
[Click to start/stop animation]
Since you probably don’t want to watch the animation endlessly I’ll put the repair sequences that are reported here:
Parsing error at line 2 column 13. Repair sequences found:
1: Insert ), Shift {, Delete if, Delete true
2: Insert ), Shift {, Shift if, Insert (, Shift true, Insert )
This example shows all of the individual repair types that CPCT+ can generate:
- ‘
Insert x’ means ‘insert a tokenxat the current position’; - ‘
Shift x’ means ‘keep the tokenxat the current position unchanged and advance the search’; - ‘
Delete x’ means ‘delete the tokenxat the current position’.
A repair sequence is just that: an ordered sequence of repairs. For example, the first repair sequence above means that the input will be repaired to be equivalent to:
class C {
void f() {
{ }
}
}
while the second repair sequence will repair the input to be equivalent to:
class C {
void f() {
if (true) { }
}
}
As this shows, CPCT+ is doing something very different to traditional error recovery: it’s repairing input spanning multiple tokens in one go. This is perfectly complementary to repairing syntax errors at different points in a file as this example shows:
[Click to start/stop animation]
Although CPCT+ can return multiple repair sequences, it would be impractical to keep all those possibilities running in parallel — I also doubt that users would be able to interpret the resulting errors! lrpar thus takes the first repair sequence returned by CPCT+, applies it to the input, and continues parsing.
At this point you might be rather sick of Java examples. Fortunately, there’s nothing Java specific about CPCT+. If I feed it Lua’s Lex and Yacc files and broken input it’ll happily repair that too7:
[Click to start/stop animation]
Indeed, CPCT+ will happily perform error recovery on any other language for which you can write a Yacc grammar.
Ultimately, CPCT+ has one main novelty relative to previous members of the Fischer et al. family: it presents the complete set of minimum cost repair sequences to the user where other approaches non-deterministically present one member of that set to the users. In other words, when, for our original Java example, CPCT+ presented this to users:
Parsing error at line 2 column 11. Repair sequences found:
1: Insert ,
2: Insert =
3: Delete y
approaches such as Corchuelo et al. would only have presented one repair sequence to the user. Since those approaches are non-deterministic, each time they’re run they can present a different repair sequence to the one before, which is rather confusing. The intuition behind “minimum cost repair sequence” is that we want to prioritise repair sequences which do the smallest number of alterations to the user’s input: insert and delete repairs increase a repair sequence’s cost, although shift repairs are cost-neutral.
To my mind, in the example above, both ‘Insert ,’ and ‘Insert =’ are equally likely to represent what the programmer intended, and it’s helpful to be shown both. ‘Delete y’ is a bit less likely to represent what the programmer intended, but it’s not a ridiculous suggestion, and in other similar contexts would be the more likely of the 3 repair sequences presented.
Under the bonnet
The paper and/or the code are the places to go if you want to know exactly how CPCT+ works, but I’ll try and give you a very rough idea of how it works here.
When lrpar encounters a syntax error, CPCT+ is started with the grammar’s statetable (the statemachine that an LR parser turns a grammar into; see e.g. this example), the current parsing stack (telling us where we are in the statetable, and how we got there), and the current position in the user’s input. By definition the top of the parsing stack will point to an error state in the statetable. CPCT+’s main job is to return a parsing stack and position to lrpar that allows lrpar to continue parsing; producing repair sequences is a happy by-product of that.
CPCT+ is thus a path-finding algorithm in disguise, and we model it as an instance of Dijkstra’s algorithm. In essence, each edge in the graph is a repair, which has a cost; we’re looking to find a path that leads us to success. In this case, “success” can occur in two ways: in rare cases where errors happen near the end of a file we might hit the statetable’s sole accept state; more commonly, we settle for shifting 3 tokens in a row (i.e. we’ve got to a point where we can parse some of the user’s input without causing further errors). As this might suggest, the core search is fairly simple.
Most of CPCT+’s complexity comes from the fact that we try to find all minimum cost paths to success and we need ways of optimising the search. There are a few techniques we describe in the paper to improve performance, so I’ll use what’s probably the most effective as an example. Our basic observation is that, when searching, once-distinct paths often end up reaching the same node, at which point we can model them as one henceforth. We therefore identify compatible nodes and merge them into one. The challenge is then how compatible nodes can be efficiently identified. We make use of an often forgotten facet of hashmaps: a node’s hash behaviour need only be a subset of its equality behaviour. In our case, nodes have three properties (parsing stack, remaining input, repair sequence): we hash based on two of these properties (parsing stack, remaining input) which quickly tells us if a node is potentially compatible; equality then checks (somewhat slowly) all three properties to confirm definite compatibility. This is a powerful optimisation, particularly on the hardest cases, improving average performance by 2x.
Ranking repairs
CPCT+ collects the complete set of minimum cost repair sequences because I thought that would best match what a user would hope to see from an error recovery algorithm. The costs of creating the complete set of minimum cost repair sequences were clear early on but, to my surprise, there are additional benefits.
The overarching problem faced by all approaches in the Fischer et al. family is that the search space is unbounded in size. This is why shifting a mere 3 tokens from the user’s input is enough for us to declare a repair sequence successful: ideally we would like to check all of the remaining input, but that would lead to a combinatorial explosion on all but a handful of inputs. Put another way, CPCT’s core search is inherently local in nature: the repair sequences it creates can still cause subsequent spurious errors beyond the small part of the input that CPCT+ has searched.
The complete set of minimum cost repair sequences allow us to trivially turn the very-local search into a regional search, allowing us to rank repair sequences in a wider context. We take advantage of the fact that CPCT+’s core search typically only finds a small handful of repair sequences. We then temporarily apply each repair sequence to the input and see how far it can parse ahead without causing an error (up to a bound of 250 tokens). We then select the (non-strict) subset which has parsed furthest ahead and discard the rest. Consider this Java example:
class C {
int x z() { }
}
When run through the “full” CPCT+ algorithm, two repair sequences are reported to the user:
Parsing error at line 2 column 11. Repair sequences found:
1: Delete z
2: Insert ;
However if I turn off ranking, we can see that the “core” of CPCT+ in fact generated three repair sequences:
Parsing error at line 2 column 11. Repair sequences found:
1: Insert =
2: Insert ;
3: Delete z
Parsing error at line 2 column 15. Repair sequences found:
1: Insert ;
In this particular run, lrpar chose to apply the ‘Insert =’ repair sequence to the input. That caused a spurious second error just beyond the region that CPCT+ had searched in. However, the other two repair sequences allow the whole file to be parsed successfully (though there is a semantic problem with the resulting parse, but that’s another matter). It might not be immediately obvious, but traditional Fischer et al. algorithms wouldn’t be able to throw away the ‘Insert =’ repair sequence and keep looking for something better, because they have no point of comparison that would allow them to realise that it’s possible to do better. In other words, the unexpected advantage of the complete set of minimum cost repair sequences is precisely that it allows us to rank repair sequences relative to one another and discard the less good.
I’ve also realised over time that the ranking process (which requires about 20 lines of code) really kills two birds with one stone. First, and most obviously, it reduces spurious syntax errors. Second — and it took me a while to appreciate this — it reduces the quantity of low-quality repair sequences we present to users, making it more likely that users will actually check the repair sequences that are presented.
Lex errors
Like most parsing systems, grmtools separates out lexing (splitting input up into tokens) from parsing (determining if/how a sequence of tokens conforms to a grammar). This article isn’t the place to get into the pros and cons of this, but one factor is relevant: as soon as the lexer encounters an error in input, it will terminate. That means that parsing won’t start and, since error recovery as we’ve defined it thus far is part of the parser, the user won’t see error recovery. That leads to frustrating situations such as this:
[Click to start/stop animation]
Although it didn’t occur to me for a very long time, it’s trivial to convert lexing errors into parsing errors, and then have error recovery happen as normal. Even more surprisingly, this doesn’t require any support from lrpar or CPCT+. The user merely needs to catch input that otherwise wouldn’t lex by adding a rule such as the following at the end of their Lex file:
. "ERROR"
This matches a single character (the ‘.’)8 as a new token type called ‘ERROR’. However, grmtools moans if tokens are defined in the Lexer but not used in the Yacc grammar so let’s shut it up by adding a dummy rule to the grammar:
ErrorRule: "ERROR" ;
Now when we run nimbleparse, CPCT+ kicks in on our “lexing error”:
Parsing error at line 2 column 11. Repair sequences found:
1: Delete #, Delete y
2: Insert ,, Delete #
3: Insert =, Delete #
I find this satisfying for two reasons: first, because users get a useful feature for precisely no additional effort on lrpar’s part; and second because lexing and parsing errors are now presented uniformly to the user, where before they were confusingly separate. It would probably be a good idea to make this a core feature, so that we could do things like merge consecutive error tokens, but that wouldn’t change the underlying technique.
Integrating error recovery into actions
As far as I have been able to tell, no “advanced” error recovery algorithm has ever made its way into a long-lasting parsing system: I couldn’t find a single implementation which I could run. Indeed, a surprising number of error recovery papers don’t even mention a corresponding implementation, though there must surely have been at least a research prototype at some point.
Whatever software did, or didn’t, exist, none of the papers I’ve read make any mention of how error recovery affects the use of parsers. Think about your favourite compiler: when it encounters a syntax error and recovers from it, it doesn’t just continue parsing, but also runs things like the type checker (though it probably refuses to generate code). Of course, the reason your favourite compiler is doing this is probably because it has a hand-written parser. How should we go about dealing with this in a Yacc-ish setting?
grmtools’ solution is surprisingly simple: action code (i.e. the code that is executed when a production is successfully parsed) isn’t given access to tokens directly, but instead to an enum which allows one to distinguish tokens inserted by error recovery from tokens in the user’s input. The reason for this is probably best easy seen via an example, in this case a very simple calculator grammar which calculates numbers as it parses:
[Click to start/stop animation]
In this case, what I chose to do when writing the calculator evaluator is to continue evaluating expressions with syntax errors in, unless an integer was inserted. The reason for that is that I simply don’t have a clue what value I should insert if CPCT+ generated an ` Insert INT’ repair: is 0 reasonable? what about -1? or 1? As this suggests, inserting tokens can be quite problematic: while one might quibble about whether evaluation should continue when CPCT+ deleted the second ‘+’ in ‘2 + + 3’, at least that case doesn’t require the evaluator to pluck an integer value out of thin air.
This is an example of what I’ve ended up thinking of as the semantic consequences of error recovery: changing the syntax of the user’s input often changes its semantics, and there is no way for grmtools to know which changes have acceptable semantic consequences and which don’t. For example, inserting a missing ‘:’ in Python probably has no semantic consequences, but inserting the integer 999 into a calculator expression will have a significant semantic consequence.
The good news is that lrpar gives users the flexibility to deal with the semantic consequences of token insertion. For example here’s the grmtools-compatible grammar for the calculator language:
Expr -> Result<u64, Box<dyn Error>>:
Expr '+' Term {
Ok($1?.checked_add($3?)
.ok_or_else(||
Box::<dyn Error>::from("Overflow detected."))?)
}
| Term { $1 } ;
Term -> Result<u64, Box<dyn Error>>:
Term '*' Factor {
Ok($1?.checked_mul($3?)
.ok_or_else(||
Box::<dyn Error>::from("Overflow detected."))?)
}
| Factor { $1 } ;
Factor -> Result<u64, Box<dyn Error>>:
'(' Expr ')' { $2 }
| 'INT' {
parse_int($lexer.span_str($1.map_err(|_|
"<evaluation aborted>")?.span()))
} ;
If you’re not used to Rust, that might look a little scary, so let’s start with some of the non-error recovery parts. First, the calculator grammar evaluates mathematical expressions as parsing occurs, and it deals exclusively in unsigned 64-bit integers. Second, unlike traditional Yacc, lrpar requires each rule to specify its return type. In this case, each rule has a return type Result<u64, Box<dyn Error>> which says “if successful this returns a u64; if unsuccessful it returns a pointer to a description of the error on the heap”. Put another way ‘dyn Error’ is Rust’s rough equivalent to “this thing will throw an exception”.
As with traditional Yacc, the ‘$n’ expressions in action code reference a symbol’s production, where n starts from 1. Symbols reference either rules or tokens. As with most parsing systems, symbols that reference rules have the static type of that rule (in this example Result<u64, Box<dyn Error>>).
Where grmtools diverges from existing systems is that tokens always have the static type Result<Lexeme, Lexeme>. If you’re used to Rust that might look a bit surprising, as Result types nearly always contain two distinct subtypes, but in this case we’re saying that in the “good” case you get a Lexeme and in the “bad” case you also get a Lexeme. The reason for this is that the “good” case (if you’re familiar with Rust terminology, the ‘Ok’ case) represents a token from the user’s actual input and the “bad” case (‘Err’) represents an inserted token. Because Result is a common Rust type, one can then use all of the standard idioms that Rust programmers are familiar with.
Let’s first look at a simplified version of the first rule in the calculator grammar:
Expr -> Result<u64, Box<dyn Error>>:
Expr '+' Term { $1? + $3? }
| Term { $1 }
The Expr rule has two productions. The second of those (‘Term’) simply passes through the result of evaluating another rule unchanged. The first production tries adding the two integers produced by other rules together, but if either of those rules produced a dyn Error then the ‘?’ operator percolates that error upwards (roughly speaking: throws the exception up the call stack).
Now let’s look at a simplified (to the point of being slightly incorrect) version of the third rule in the grammar:
Factor -> Result<u64, Box<dyn Error>>:
'(' Expr '%avoid_insert
directive) that enable it to narrow down its search.
Programmers unintentionally leave hints in their input (e.g. indentation),
and languages have major structural components (e.g. block markers such as
curly brackets), that error recovery can take into account (see e.g. this approach). To take advantage of this, the grammar author would need to
provide hints such as “what are the start / end markers of a block”. Such hints
would be optional, but my guess is that most grammar authors would find the
resulting improvements sufficiently worthwhile that they’d be willing to invest
the necessary time to understand how to use them.
Finally, some parts of grammars necessarily allow huge numbers of
alternatives and error recovery at those points is hard work. The most obvious
example of this are binary or logical expressions, where many different
operators (e.g. ‘+’, ‘||’ etc.) are possible. This can
explode the search space, occasionally causing error recovery to fail, or
more often, for CPCT+ to generate an overwhelming number of repair sequences.
My favourite example of this – and this is directly taken from a real
example, albeit with modified variable names! – is the seemingly innocent,
though clearly syntactically incorrect, Java expression x = f(""a""b);. CPCT+ generates a comical
23,607 repair sequences for it. What is the user supposed to do with all that
information? I don’t know! One thing I experimented with at some points was
making the combinatorial aspect explicit so that instead of:
Insert x, Insert +, Insert y
Insert x, Insert +, Insert z
Insert x, Insert -, Insert y
Insert x, Insert -, Insert z
the user would be presented with:
Insert x, Insert {+, -}, Insert {y, z}
For various boring reasons, that feature was removed at some point, but writing
this down makes me think that it should probably be reintroduced. It wouldn’t
completely solve the “overwhelming number of repair sequences” problem, but it
would reduce it, probably substantially.
Summary
Parsing is the sort of topic that brings conversations at parties to a
standstill. However, since every programmer relies upon parsing, the better a
job we can do of helping them fix errors quickly, the better we make their lives. CPCT+ will never
beat the very best hand-written error recovery algorithms, but what it does do
is bring pretty decent error recovery to any LR grammar. I hope and expect
that better error recovery algorithms will come along in the future, but
CPCT+ is here now, and if you use Rust, you might want to take a look at grmtools —
I’d suggest starting with the
quickstart guide in the grmtools book. Hopefully Yacc parsers for other languages might port
CPCT+, or something like it, to their implementations, because there isn’t
anything very Rust specific about CPCT+, and it’s a fairly easy algorithm to
implement (under 500 lines of code in its Rust incantation).
Finally, one of the arguments that some people, quite reasonably, use
against LR parsing is that it has poor quality error recovery. That’s a shame
because, in my opinion LR parsing is a beautiful approach to parsing. I hope that CPCT+’s error recovery
helps to lessen this particular obstacle to the use of LR parsing.
Acknowledgements: My thanks to Edd Barrett and Lukas Diekmann for comments.
Newer
2020-11-17 08:00
Older
If you’d like updates on new blog posts: follow me on
Mastodon
or Twitter;
or subscribe to the RSS feed;
or subscribe to email updates:
Footnotes
1Often said to be the result of “fat fingers”. I have skinny fingers but, alas,
this does not seem to have translated to high typing accuracy.
☒Often said to be the result of “fat fingers”. I have skinny fingers but, alas,
this does not seem to have translated to high typing accuracy.
2Typically using the FOLLOW set.
☒Typically using the FOLLOW set.
3When panic mode was introduced, computers were so slow that I expect even bad
error recovery was probably semi-useful: if a compiler takes
several minutes to run, each genuine error reported to the programmer is
valuable, and it’s worth risking them seeing a fairly high proportion of false errors. In
contrast, when compilers (mostly) take at most a few tens of seconds, the
acceptable proportion of false errors is much lower.
☒When panic mode was introduced, computers were so slow that I expect even bad
error recovery was probably semi-useful: if a compiler takes
several minutes to run, each genuine error reported to the programmer is
valuable, and it’s worth risking them seeing a fairly high proportion of false errors. In
contrast, when compilers (mostly) take at most a few tens of seconds, the
acceptable proportion of false errors is much lower.
4There are languages for which one can fairly easily write a Yacc grammar but
not a Lex lexer (e.g. a Python-ish language with indentation-based syntax).
Systems like grmtools allow you to use your own (possibly hand-written) lexer
for precisely this reason. Fortunately, writing a good quality lexer by hand is
generally a fairly easy task.
☒There are languages for which one can fairly easily write a Yacc grammar but
not a Lex lexer (e.g. a Python-ish language with indentation-based syntax).
Systems like grmtools allow you to use your own (possibly hand-written) lexer
for precisely this reason. Fortunately, writing a good quality lexer by hand is
generally a fairly easy task.
5As a parsing research dilettante, I’ve long had a suspicion that proper parsing
researchers have to sign up to an elaborate, long running joke, whereby they
agree to leave out, or otherwise obscure, important details. Alas, most error
recovery papers are also part of this conspiracy, so I lost months to
misunderstandings of various papers. The most frustrating part is that I wonder
if I’ve unintentionally signed up to the conspiracy too: I have no idea whether
other people can make sense of the parsing papers I’ve been part of or not…
☒As a parsing research dilettante, I’ve long had a suspicion that proper parsing
researchers have to sign up to an elaborate, long running joke, whereby they
agree to leave out, or otherwise obscure, important details. Alas, most error
recovery papers are also part of this conspiracy, so I lost months to
misunderstandings of various papers. The most frustrating part is that I wonder
if I’ve unintentionally signed up to the conspiracy too: I have no idea whether
other people can make sense of the parsing papers I’ve been part of or not…
6Note that the order that the three repair sequences are presented in is
nondeterministic, so if you run it enough times you’ll see the repair sequences
printed in all possible orderings.
☒Note that the order that the three repair sequences are presented in is
nondeterministic, so if you run it enough times you’ll see the repair sequences
printed in all possible orderings.
7If you’re wondering why I used the ‘-q’ option with nimbleparse,
it’s because nimbleparse prints debugging information for ambiguous grammars.
Lua’s grammar has an inherent ambiguity (with the Lua manual containing the
slightly scary statement that “The current parser always [resolves] such constructions in the first way”), which causes
nimbleparse to print out its full stategraph and the ambiguities. Even for a
small grammar such as Lua’s, the stategraph is not small. As this shows, Yacc
input grammars can be ambiguous, though they will be statically converted into
a fully unambiguous LR parser. Whether this is a good or bad feature is
something that I’ll leave to the philosophers amongst you to debate.
☒If you’re wondering why I used the ‘-q’ option with nimbleparse,
it’s because nimbleparse prints debugging information for ambiguous grammars.
Lua’s grammar has an inherent ambiguity (with the Lua manual containing the
slightly scary statement that “The current parser always [resolves] such constructions in the first way”), which causes
nimbleparse to print out its full stategraph and the ambiguities. Even for a
small grammar such as Lua’s, the stategraph is not small. As this shows, Yacc
input grammars can be ambiguous, though they will be statically converted into
a fully unambiguous LR parser. Whether this is a good or bad feature is
something that I’ll leave to the philosophers amongst you to debate.
8Although ‘.’ is a safe default, there’s no reason why the error
token has to be defined that way. However, one has to be careful if using this
technique using Lex because its “longest match” rule means that the the text
matched by the error token must never be longer than that matched by any other
token, otherwise large chunks of input end up being incorrectly classified as
an error token. In case you’re wondering, yes, I spent half an hour when
writing this blog post wondering why my attempts to be clever weren’t working
before remembering the longest match rule.
☒Although ‘.’ is a safe default, there’s no reason why the error
token has to be defined that way. However, one has to be careful if using this
technique using Lex because its “longest match” rule means that the the text
matched by the error token must never be longer than that matched by any other
token, otherwise large chunks of input end up being incorrectly classified as
an error token. In case you’re wondering, yes, I spent half an hour when
writing this blog post wondering why my attempts to be clever weren’t working
before remembering the longest match rule.
9The original Corchuelo et al. has a bug that means that it misses some
minimum cost repair sequences.
☒The original Corchuelo et al. has a bug that means that it misses some
minimum cost repair sequences.
10My experience so far suggests that this is more often the case in less well
used parts of a grammar.
☒My experience so far suggests that this is more often the case in less well
used parts of a grammar.
Comments