Regex Utils
Zero-dependency TypeScript library for regex utilities that go beyond string matching. These are surprisingly hard to come by for any programming language. β¨
-
Online demos:
API Overview π
Regex Utils
Zero-dependency TypeScript library for regex utilities that go beyond string matching. These are surprisingly hard to come by for any programming language. β¨
-
Online demos:
API Overview π
-
π Set-like operations:
-
.and(...) - Compute intersection of two regex.
-
.not() - Compute the complement of a regex.
-
.without(...) - Compute the difference of two regex.
-
β Set-like predicates:
-
.isEquivalent(...) - Check whether two regex match the same strings.
-
π Generate strings:
-
.sample(...) - Generate random strings matching a regex.
-
.enumerate() - Exhaustively enumerate strings matching a regex.
-
π§ Miscellaneous:
-
.size() - Count the number of strings that a regex matches.
-
.derivative(...) - Compute a Brzozowski derivative of a regex.
Installation π¦
npm install @gruhn/regex-utils
import { RB } from '@gruhn/regex-utils'
Syntax Support
| Feature | Support | Examples |
|---|---|---|
| Quantifiers | β | a*, a+, a{3,10}, a? |
| Alternation | β | `a |
| Character classes | β | ., \w, [a-zA-Z], ... |
| Escaping | β | \$, \., ... |
| (Non-)capturing groups | β 1 | (?...), (...) |
| Start/end anchors | β οΈ 2 | ^, $ |
| Global flags | β | /.../g, /.../i, /.../m, ... |
| Local flags | β | (?i:...), ... |
| Unicode property escapes | β | \p{...}, \P{...} |
| Backreferences | β | \1 \2 ... |
| Lookahead | β | (?=...), (?!...) |
| Lookbehind | β | (?<=...), (?<!...) |
| Word boundary | β | \b, \B |
- Both capturing- and non-capturing groups are just treated as parenthesis, because this library is never doing string extraction.
- Some pathological patterns are not supported like anchors inside quantifiers
(^a)*.
An UnsupportedSyntaxError is thrown when unsupported patterns are detected. The library SHOULD ALWAYS either throw an error or respect the regex specification exactly. Please report a bug if the library silently uses a faulty interpretation.
Handling syntax-related errors:
import { RB, ParseError, UnsupportedSyntaxError } from '@gruhn/regex-utils'
try {
RB(/^[a-z]*$/)
} catch (error) {
if (error instanceof SyntaxError) {
// Invalid regex syntax! Native error, not emitted by this library.
// E.g. this will also throw a `SyntaxError`: new RegExp(')')
} else if (error instanceof ParseError) {
// The regex syntax is valid but the internal parser could not handle it.
// If this happens it's a bug in this library.
} else if (error instanceof UnsupportedSyntaxError) {
// Regex syntax is valid but not supported by this library.
}
}
Example Use Cases π‘
Generate random strings from Regex π
Generate 5 random email addresses:
const email = RB(/^[a-z]+@[a-z]+\.[a-z]{2,3}$/)
for (const str of email.sample().take(5)) {
console.log(str)
}
ky@e.no
cc@gg.gaj
z@if.ojk
vr@y.ehl
e@zx.hzq
Generate 5 random email addresses, which have exactly 20 characters:
const emailLength20 = email.and(/^.{20}$/)
for (const str of emailLength20.sample().take(5)) {
console.log(str)
}
kahragjijttzyze@i.mv
gnpbjzll@cwoktvw.hhd
knqmyotxxblh@yip.ccc
kopfpstjlnbq@lal.nmi
vrskllsvblqb@gemi.wc
Refactor Regex then Check Equivalence π
Say we identified a regex in the codebase that is prone to catastrophic backtracking and came up with a new version:
const oldRegex = /^(?:[a-zA-Z]\:\\|\\\\)([^\\\/\:\*\?\<\>\"\|]+(\\){0,1})+$/
const newRegex = /^(?:[a-zA-Z]:\\|\\\\)([^\\\/\:*?<>"|]+\\?)+$/
Using .isEquivalent we can verify that the refactored version matches exactly the same strings as the old version. That is, whether oldRegex.test(str) === newRegex.test(str) for every possible input string:
RB(oldRegex).isEquivalent(newRegex) // true
There is also a web interface for checking regex equivalence which also shows counterexample strings if the two regular expressions are not equivalent. The source code is a single HTML file: ./equiv-checker.html.
Comment Regex using Complement π¬
How do you write a regex that matches HTML comments like:
<!-- This is a comment -->
A straightforward attempt would be:
<!--.*-->
The problem is that .* also matches the end marker -->, so this is also a match:
<!-- This is a comment --> and this shouldn't be part of it -->
We need to specify that the inner part can be any string that does not contain -->. With .not() (aka. regex complement) this is easy:
import { RB } from '@gruhn/regex-utils'
const commentStart = RB('<!--')
const commentInner = RB(/^.*-->.*$/).not()
const commentEnd = RB('-->')
const comment = commentStart.concat(commentInner).concat(commentEnd)
With .toRegExp() we can convert back to a native JavaScript regex:
comment.toRegExp()
/^<!--(---*[^->]|-?[^-])*---*>$/
Password Regex using Intersections π
Itβs difficult to write a single regex for multiple independent constraints. For example, to specify a valid password. But with regex intersections itβs very natural:
import { RB } from '@gruhn/regex-utils'
const passwordRegex = RB(/^[a-zA-Z0-9]{12,32}$/) // 12-32 alphanumeric characters
.and(/[0-9]/) // at least one number
.and(/[A-Z]/) // at least one upper case letter
.and(/[a-z]/) // at least one lower case letter
We can convert this back to a native JavaScript RegExp with:
passwordRegex.toRegExp()
Note
The output RegExp can be very large.
We can also use other utilities like .size() to determine how many potential passwords match this regex:
console.log(passwordRegex.size())
2301586451429392354821768871006991487961066695735482449920n
With .sample() we can generate some of these matches:
for (const str of passwordRegex.sample().take(10)) {
console.log(str)
}
NEWJIAXQISWT0Wwm
lxoegadrzeynezkmtfcIBzzQ9e
ypzvhvtwpWk4u6
MSZXXKIKEKWKXLQ8HQ7Ds
BCBSFBSMNOLKlgQN5L
8950244600709IW1pg
UOTQBLVOTZQWFSAJYBXZNQBEeom0l
520302447164378435bv4dp4ysC
71073970686490eY2Jt4
afgpnxqwUK5B
Solve Advent Of Code 2023 - Day 12 π
In the coding puzzle Advent Of Code 2023 - Day 12 you are given pairs of string patterns. An example pair is .??..??...?##. and 1,1,3. Both patterns describe a class of strings and the task is to count the number of strings that match both patterns.
In the first pattern, . and # stand for the literal characters "dot" and "hash". The ? stands for either . or #. This can be written as a regular expression:
- for
#we simply write# - for
.we writeo(since.has a special meaning in regular expressions) - for
?we write(o|#)
So the pattern .??..??...?##. would be written as:
const firstRegex = /^o(o|#)(o|#)oo(o|#)(o|#)ooo(o|#)##o$/
In the second pattern, each digit stands for a sequence of # separated by at least one o. This can also be written as a regular expression:
- For a digit like
3we write#{3}. - Between digits we write
o+. - Additionally, arbitrary many
oare allowed at the start and end, so we addo*at the start and end.
Thus, 1,1,3 would be written as:
const secondRegex = /^o*#{1}o+#{1}o+#{3}o*$/
To solve the task and find the number of strings that match both regex, we can use .and(...) and .size() from regex-utils. .and(...) computes the intersection of two regular expressions. That is, it creates a new regex which exactly matches the strings matched by both input regex.
const intersection = RB(firstRegex).and(secondRegex)
With .size() we can then determine the number of matched strings:
console.log(intersection.size())
4n
While at it, we can also try .enumerate() to list all these matches:
for (const str of intersection.enumerate()) {
console.log(str)
}
oo#ooo#ooo###o
o#oooo#ooo###o
oo#oo#oooo###o
o#ooo#oooo###o
For a full solution checkout: ./benchmark/aoc2023-day12.ts.
References π
Heavily informed by these papers: