A data-directed approach to lexing.

I invented a simple way to write lexers. I should probably write a formal paper about it but I have no academic institution nor business pushing me to, so I probably won’t. It is data-directed, so it has flow-of-control far simpler than most lexers. It relies heavily on bitmask operations instead of control logic.

The lexer uses a hash table. The table is keyed by an input character and optionally by the position in the token where the character is read. There is a ‘base’ entry for every character, automatically generated from the character properties and the lexical rules for simple nonfixed tokens like identifiers and numbers. There are additional entries for characters which can appear in fixed tokens, indexed by both the character and the positions in which it appears in fixed …

The number of bits in each bitmask in this system is one per token class you want to recognize. Most of these are fixed tokens such as keywords and operators. Some are nonfixed tokens such as identifiers, numbers, and the text of inline constants. Depending on the complexity of the nonfixed tokens some may need to be represented by more than one bit. For example if you have a "no tabs in indentation" rule you may have type-1 whitespace that consists of a newline followed by some number of spaces, and type-2 whitespace that may contain spaces and tabs but no newlines. The parser could then enforce the no-tabs-in-indentation rule by declaring an error if it finds any instances of type-1 immediately followed by type-2.

Each fixed token corresponds to one bit on the bitmask. If the bit is on, then that keyword or syntax has not yet been ruled out as the value of the token being parsed.

So the lexer works thus:

// PSEUDOCODE ONLY. THIS IS NOT EXPECTED TO COMPILE, THIS HAS NOT BEEN TESTED, AND IT HAS NOT EVEN BEEN FORMALLY VERIFIED.
{
for (index=0, workmask = oldmask = ALL_ONES; workmask != 0; index++) {
input = readInput();
oldmask = workmask;
workmask &= table(input, index);
}
unread(input)                        // push excluded letter back into file cache, to be read as first character next time.
repeat{
tokenidx = round(log2(oldmask));                        // get the index of the highest bit set
if ((tokenidx < fixtokencount) ||                       // if it's the bit of a nonfixed token return that
(tokenlengths[tokenidx] == index return(tokenidx);  // if it's the bit of a fixed token check to make sure length matches.
oldmask -= 0x1<<tokenidx;                               // clear the bit.
}
until (oldmask == 0);                                      // until we have returned or all remaining set bits have been checked.
print ("This never happens.");                             // If we get here it's a bug, so report an error.
}

At the start of a new token, it sets its working bitmask to all ones. Then it iterates through the input. Each time it reads a character C at position P, it fetches bitmask(C,P) from the hash table and ANDs it with its working bitmask.

So for example, we can set our working bitmask to ones, then process ‘w’, ‘h’, ‘i’ ‘l’, ‘e’, ’ ’, by masking it with table(‘w’,0), table(‘h’,1), table(‘i’,2), table(‘l’,3), and table(‘e’,4). At this point the bit for the ‘while’ keyword is the only one still set. When we mask with table(’ ’,5) it goes dark. We check that the length of the token corresponding with that bit matches the number of tokens we’ve processed and output ‘while’.

(note: in this case the ‘while’ bit and the ‘identifier’ bit will both be set. After processing the table(’ ’,5) both will be unset, so both are candidates in the ‘last nonzero mask’ rule. But given a conflict between keyword and identifier, we will always pick the keyword.)

The length check is to handle things like ‘-’ and ‘–’ and ‘-=’ all being syntax. After masking with table(‘-’,0) all three will still be set. Some letter in position ‘1’ would mask all three at the same time. But only one of those three finalists has length equal to the number of characters processed.

There is a ‘base’ bitmap for each character. These simply define a character’s relationship to the non-fixed categories with properties such as whether it’s a digit, hex digit, binary digit, possible identifier character, etc. These are auto-generated based on the character’s properties, and ignore the fixed tokens completely.

What I like about this method is the absolute simplicity. of setting it up. You simply iterate through your list of fixed tokens, doing very obvious things with each:

// PSEUDOCODE ONLY. THIS IS NOT EXPECTED TO COMPILE, THIS HAS NOT BEEN TESTED, AND IT HAS NOT EVEN BEEN FORMALLY VERIFIED.

for (bitplace = fixtokencount; bitplace >0;  bitplace--){         // fixed-tokens have highest priority so they get highest bit positions
bitval = 0x1 << bitplace;                                       // nonfixed tokens like identifiers and numbers will have low bit positions
for (letters in fixtokenlist[bitplace] )
Tmask = table(letter,place) if it exists, table(letter,base) if it doesn't.  // get existing or copy default bitmask for letter and place
Tmask &= ~bitval;                                                            // poke a hole at bitplace
table(a,place) = Tmask.                                                      // store the changed mask back in the table
}
Lexerlengths[bitplace]=place;                                                   // store the length of the token in the lexer's length table.
}

And you are done with fixed tokens. The lexer will recognize all of them, won’t hallucinate tokens that don’t exist, will run reasonably efficiently, and most important, the source will explain very clearly exactly how and why.

There is no "mysterious" or "difficult" or even "complicated" thing going on here. Deterministic finite-state Automata and Chomsky grammars and state machines are lovely, but the way this works is so brutally simple it should be clear to any student on the first day it’s explained.

Benefits:

Stunning simplicity of implementation. Better scaling than state-machine or flow-of-control approaches. Takes advantage of bit operations to parallelize a lot of things that would otherwise require sequential or carefully state-dependent implementation. Likely to further parallelize more easily than existing approaches.

Costs:

One Hash table lookup per character read is definitely the most expensive thing I’m doing. Depending on the size of the code set in use and the size of the bitmap, the table can break L2 cache even on high-end desktop machines. A complex language having ~250 fixed tokens (using a 16-byte bitmap) and all ~169K unicode codepoints (excluding private use, excluding surrogates) would make the table about 2.6 megabytes plus overhead. Overhead is brutal for hash tables, so that’s no smaller than 4.5 megabytes and means definitely a lot of L2 cache misses. An implementation excluding Chinese characters and using an 8-byte bitmask on the other hand could have a table under 0.5 megabytes plus overhead, which will fit handily into most L2 caches.

I’m using a logarithm operation to get the highest bit set. On reflex this seemed wrong because "transcendental function are SLOOOWW!" However on reflection I realized I had learned that a very long time ago and it’s no longer true. Logarithms are implemented directly in hardware on modern CPUs and take barely longer than a multiplication.

Similar Posts