PCRE.NET
Perl Compatible Regular Expressions for .NET
PCRE.NET is a .NET wrapper for the PCRE2 library.
The library provides variants for UTF-16 (.NET string and ReadOnlySpan<char>) and 8-bit encodings such as UTF-8 (ReadOnlySpan<byte>).
The following systems are supported:
- Windows x64
- Windows x86
- Linux x64
- Linux arm64
- macOS arm64
- macOS x64
API Types
The classic API
This is a friendly API that is very similar to .NET’s System.Text.RegularExpressions. It works on string objects, and supports the following operations:
-
NFA matching and substring extraction:
-
Matches -
Match -
IsMatch -
Matched string replacement:
-
Using
Replace, the PCRE.NET API: -
Callbacks: `Func<PcreMatch, str…
PCRE.NET
Perl Compatible Regular Expressions for .NET
PCRE.NET is a .NET wrapper for the PCRE2 library.
The library provides variants for UTF-16 (.NET string and ReadOnlySpan<char>) and 8-bit encodings such as UTF-8 (ReadOnlySpan<byte>).
The following systems are supported:
- Windows x64
- Windows x86
- Linux x64
- Linux arm64
- macOS arm64
- macOS x64
API Types
The classic API
This is a friendly API that is very similar to .NET’s System.Text.RegularExpressions. It works on string objects, and supports the following operations:
-
NFA matching and substring extraction:
-
Matches -
Match -
IsMatch -
Matched string replacement:
-
Using
Replace, the PCRE.NET API: -
Callbacks:
Func<PcreMatch, string> -
Replacement strings with placeholders:
$n ${name} $& $_ $\$’ $+` -
Using
Substitute, the PCRE2 API: -
Replacement strings with placeholders:
$n ${n} $& $_ $\$’ $+ $$ $*MARK ${*MARK}` -
Callouts for matches and substitutions
-
String splitting on matches:
Split
The Span API
PcreRegex objects provide overloads which take a ReadOnlySpan<char> parameter for the following methods:
MatchesMatchIsMatchSubstitute
These methods return a ref struct type when possible, but are otherwise similar to the classic API.
The zero-allocation API
This is the fastest matching API the library provides.
Call the CreateMatchBuffer method on a PcreRegex or PcreRegex8Bit/PcreRegexUtf8 instance to create the necessary data structures up-front, then use the returned match buffer for subsequent match operations. Performing a match through this buffer will not allocate further memory, reducing GC pressure and optimizing the process.
The downside of this approach is that the returned match buffer is not thread-safe and not reentrant: you cannot perform a match operation with a buffer which is already being used - match operations need to be sequential.
It is also counter-productive to allocate a match buffer to perform a single match operation. Use this API if you need to match a pattern against many subject strings.
PcreMatchBuffer objects are disposable (and finalizable in case they’re not disposed). They provide an API for matching against ReadOnlySpan<char> subjects. The same applies for PcreMatchBuffer8Bit objects on ReadOnlySpan<byte> subjects.
If you’re looking for maximum speed, consider using the following options:
PcreOptions.Compiledat compile time to enable the JIT compiler, which will improve matching speed.PcreMatchOptions.NoUtfCheckat match time to skip the Unicode validity check: by default PCRE2 scans the entire input string to make sure it’s valid Unicode.PcreOptions.MatchInvalidUtfat compile time if you plan to usePcreMatchOptions.NoUtfCheckand your subject strings may contain invalid Unicode sequences.
The 8-bit and UTF-8 APIs
The PcreRegex8Bit class handles text provided as ReadOnlySpan<byte>. It requires an Encoding instance to interpret the byte sequences and turn them into .NET strings for usages such as easy handling of named groups as .NET strings. When in doubt, with one byte per character encodings, you can use ISO-8859-1 (Encoding.Latin1).
PcreRegexUtf8 is a specialization of PcreRegex8Bit which handles the input as UTF-8 encoded text. It is provided for convenience, as UTF-8 usage is very common. You could achieve the same result by using PcreRegex8Bit with Encoding.UTF8 and the PcreOptions.Utf flag.
A Span API similar to the one mentioned above is provided with the following methods:
MatchesMatchIsMatch
There is also a zero-allocation API through the CreateMatchBuffer method.
The DFA matching API
This API provides regex matching in O(subject length) time. It is accessible through the Dfa property on a PcreRegex instance:
Dfa.MatchesDfa.Match
You can read more about its features in the PCRE2 documentation, where it’s described as the alternative matching algorithm.
Library highlights
- Support for UTF-8 and UTF-16. Other 8-bit encodings are also supported.
- Support for compiled patterns (x86/x64/arm64 JIT)
- Support for partial matching (when the subject is too short to match the pattern)
- Callout support (numbered and string-based)
- Mark retrieval support
- Conversion from POSIX BRE, POSIX ERE and glob patterns (
PcreConvertclass)
Example usage
- Extract all words except those within parentheses:
var matches = PcreRegex.Matches("(foo) bar (baz) 42", @"\(\w+\)(*SKIP)(*FAIL)|\w+")
.Select(m => m.Value)
.ToList();
// result: "bar", "42"
- Enclose a series of punctuation characters within angle brackets using
Replace(the PCRE.NET API):
var result = PcreRegex.Replace("hello, world!!!", @"\p{P}+", "<$&>");
// result: "hello<,> world<!!!>"
- Enclose a series of punctuation characters within angle brackets using
Substitute(the PCRE2 API):
var result = PcreRegex.Substitute("hello, world!!!", @"\p{P}+", "<$0>", PcreOptions.None, PcreSubstituteOptions.SubstituteGlobal);
Assert.That(result, Is.EqualTo("hello<,> world<!!!>"));
- Partial matching:
var regex = new PcreRegex(@"(?<=abc)123");
var match = regex.Match("xyzabc12", PcreMatchOptions.PartialSoft);
// result: match.IsPartialMatch == true
- Validate a JSON string:
const string jsonPattern = """
(?(DEFINE)
# An object is an unordered set of name/value pairs.
(?<object> \{
(?: (?&keyvalue) (?: , (?&keyvalue) )* )?
(?&ws) \} )
(?<keyvalue>
(?&ws) (?&string) (?&ws) : (?&value)
)
# An array is an ordered collection of values.
(?<array> \[
(?: (?&value) (?: , (?&value) )* )?
(?&ws) \] )
# A value can be a string in double quotes, or a number,
# or true or false or null, or an object or an array.
(?<value> (?&ws)
(?: (?&string) | (?&number) | (?&object) | (?&array) | true | false | null )
)
# A string is a sequence of zero or more Unicode characters,
# wrapped in double quotes, using backslash escapes.
(?<string>
" (?: [^"\\\p{Cc}]++ | \\u[0-9A-Fa-f]{4} | \\ ["\\/bfnrt] )* "
# \p{Cc} matches control characters
)
# A number is very much like a C or Java number, except that the octal
# and hexadecimal formats are not used.
(?<number>
-? (?: 0 | [1-9][0-9]* ) (?: \. [0-9]+ )? (?: [Ee] [-+]? [0-9]+ )?
)
# Whitespace
(?<ws> \s*+ )
)
\A (?&ws) (?&object) (?&ws) \z
""";
var regex = new PcreRegex(jsonPattern, PcreOptions.IgnorePatternWhitespace);
const string subject = """
{
"hello": "world",
"numbers": [4, 8, 15, 16, 23, 42],
"foo": null,
"bar": -2.42e+17,
"baz": true
}
""";
var isValidJson = regex.IsMatch(subject);
// result: true