Irregular expressions
Regular expressions are fascinating to me. On one hand, they can be extremely succinct, expressive, and efficient. On the other hand, they can be basically write-only. They come with a simple but powerful theory that leads to efficient implementations. Sadly, many implementations ignore the theory in order to offer additional features, at the cost of worst-case exponential complexity.
It is possible, however, to implement some of those additional features, and still operate in worst-case linear time.
It was pointed out to me by Geoff Langdale and Evgeny Kapun that my implementation of those additional features is incorrect. I've hidden the wrong implementations for now.
The implementation (~400 lines of Rust) even fits in a single blog post!
The full code is on GitHub, and the commit history lines up with the blog post if you want to follow along.
State machines
The key to worst-case linear running time is to use Nondeterministic Finite Automata (NFAs). An NFA is basically a flow chart for deciding whether to accept or reject a string. The N in NFA means that multiple states can be active at the same time.
Before actually implementing NFAs, it's nice to start with an interface for them:
/// A regular expression matcher.
trait Matcher {
/// Activate this matcher's *start* state.
fn start(&mut self) -> bool;
/// Process a character from the string we're matching against.
fn push(&mut self, c: char) -> bool;
}
A Matcher
starts out with no active states.
Matcher::start()
activates the start state, and returns whether the matcher is currently in an accepting state (which could happen if, for example, the regex matches an empty string).
To test if a string matches, you feed its characters one-by-one to Matcher::push()
, or just call this helper method to do it for you:
trait Matcher {
...
/// Test if a string matches.
fn matches(&mut self, text: &str) -> bool {
text.chars().fold(self.start(), |_, c| self.push(c))
}
}
The simplest regex is the empty regex, so let's start with that.
/// A regular expression.
enum Regex {
/// The empty regex.
Empty,
}
The empty matcher has only one state.
/// Matcher for Regex::Empty.
#[derive(Default)]
struct Empty {
matched: bool,
}
impl Matcher for Empty {
fn start(&mut self) -> bool {
self.matched = true;
true
}
fn push(&mut self, _: char) -> bool {
self.matched = false;
false
}
}
It would be convenient if we could parse regexes instead of having to write things like Regex::Empty
explicitly.
We can use nom to build a parser, starting with a simple grammar.
/// Parser implementation.
fn regex(pattern: &str) -> IResult<&str, Regex> {
let empty = success(Regex::Empty);
let group = delimited(char('('), regex, char(')'));
alt((group, empty))
.parse(input)
}
The next-simplest regex is the .
metacharacter, which matches any character.
enum Regex {
...
/// Matches any one character.
Dot,
}
The matcher for .
has two states.
/// Matcher for Regex::Dot.
#[derive(Default)]
struct Dot {
started: bool,
matched: bool,
}
impl Matcher for Dot {
fn start(&mut self) -> bool {
self.started = true;
self.matched
}
fn push(&mut self, _: char) -> bool {
self.matched = self.started;
self.started = false;
self.matched
}
}
It's a small addition to the parser as well.
fn regex(pattern: &str) -> IResult<&str, Regex> {
...
let dot = char('.').map(|_| Regex::Dot);
alt((dot, group, empty))
.parse(input)
}
The next regex to support is character literals like c
that match only themselves.
enum Regex {
...
/// Matches a literal character.
Literal(char),
}
The matcher for literals is very similar to Dot
.
/// Matcher for Regex::Literal.
struct Literal {
c: char,
started: bool,
matched: bool,
}
impl Matcher for Literal {
fn start(&mut self) -> bool {
self.started = true;
self.matched
}
fn push(&mut self, c: char) -> bool {
self.matched = self.started && c == self.c;
self.started = false;
self.matched
}
}
The parser is extended with support for literals, as well as escape sequences like \.
to allow matching metacharacters literally.
fn regex(pattern: &str) -> IResult<&str, Regex> {
...
let meta = r"\().";
let literal = none_of(meta).map(Regex::Literal);
let escape = preceded(char('\\'), one_of(meta))
.map(Regex::Literal);
alt((literal, escape, dot, group, empty))
.parse(input)
}
So far our regexes can only match up to a single character.
The next step is to support the Kleene star operator *
, for regexes like a*
that match ""
, "a"
, "aa"
, "aaa"
, etc.
enum Regex {
...
/// Matches zero or more repetitions.
Star(Box<Regex>),
}
The NFA for the *
operator adds a loop with -transitions to let the pattern match multiple times.
It doesn't even need any states of its own.
/// Matcher for Regex::Star.
struct Star<A>(A);
impl<A: Matcher> Matcher for Star<A> {
fn start(&mut self) -> bool {
self.0.start() || true
}
fn push(&mut self, c: char) -> bool {
self.0.push(c) && self.start()
}
}
To parse the *
operator properly we have to shuffle some grammar rules around.
fn regex(pattern: &str) -> IResult<&str, Regex> {
...
let meta = r"\().*";
...
let atom = alt((literal, escape, dot, group));
let repeat = flat_map(atom, |r| fold_many0(
char('*'),
move || r.clone(),
|r, _| r.star(),
));
alt((repeat, empty))
.parse(input)
}
It's a bit sad that our regexes can currently only have one (non-meta) character.
We can handle a
, a*
, and even the questionably-useful a**
, but in order to support regexes like ab*c
, we need our first binary operator: concatenation.
enum Regex {
...
/// Matches two patterns in a row.
Concat(Box<Regex>, Box<Regex>),
}
The matcher for concatenated patterns connects the accept state of the first to the start state of the second.
/// Matcher for Regex::Concat.
struct Concat<L, R> {
left: L,
right: R,
right_started: bool,
}
impl<L: Matcher, R: Matcher> Matcher for Concat<L, R> {
fn start(&mut self) -> bool {
if self.left.start() {
self.right_started = true;
self.right.start();
} else {
false
}
}
fn push(&mut self, c: char) -> bool {
let mut ret = false;
if self.right_started {
ret |= self.right.push(c);
}
if self.left.push(c) {
self.right_started = true;
ret |= self.right.start();
}
ret
}
}
Just one more grammar rule is all we need to parse concatenations.
fn regex(pattern: &str) -> IResult<&str, Regex> {
...
let word = many1(repeat)
.map(|v| reduce(v, Regex::concat));
alt((word, empty))
.parse(input)
}
Our next feature is the alternation operator |
, which lets us match either (or both) of two patterns.
enum Regex {
...
/// Matches either of two patterns.
Or(Box<Regex>, Box<Regex>),
}
The matcher for alternations joins both patterns together with -transitions.
/// Matcher for Regex::Or.
struct Or<A, B>(A, B);
impl<A, B> Matcher for Or<A, B>
where
A: Matcher,
B: Matcher,
{
fn start(&mut self) -> bool {
self.0.start() | self.1.start()
}
fn push(&mut self, c: char) -> bool {
self.0.push(c) | self.1.push(c)
}
}
A couple more grammar rules are all we need.
fn regex(pattern: &str) -> IResult<&str, Regex> {
...
let meta = r"\().*|";
...
let chunk = alt((word, empty));
separated_list1(char('|'), chunk)
.map(|v| reduce(v, Regex::or))
.parse(pattern)
}
The features we implemented so far are enough to express every regular language—in other words, to match anything that could be matched by a regular expression.
Actually, we didn't even need .
, though it would be very inconvenient to have to write
(a|b|c|...|A|B|C|...|0|1|2|...|!|@|#|...)
instead.
There are still a few other features that regex engines commonly provide, such as the +
(repeat 1 or more times) and ?
(match 0 or 1 times) operators, that we could easily implement.
enum Regex {
...
/// Matches one or more repetitions.
Plus(Box<Regex>),
/// Matches zero or one times.
Maybe(Box<Regex>),
}
fn regex(pattern: &str) -> IResult<&str, Regex> {
...
let meta = r"\().*|+?";
...
let repeat = flat_map(atom, |r| fold_many0(
one_of("*+?"),
move || r.clone(),
|r, c| match c {
'*' => r.star(),
'+' => r.plus(),
'?' => r.maybe(),
_ => unreachable!(),
},
));
...
}
We don't even need any new matchers; we can just re-use existing ones.
impl Regex {
/// Compile a regular expression.
fn matcher(&self) -> Box<dyn Matcher> {
match self {
...
Self::Plus(a) => Box::new(
// A+ is the same as AA*
a.matcher().concat(a.matcher().star())
),
Self::Maybe(a) => Box::new(
// A? is the same as (A|)
a.matcher().or(Empty::default())
),
}
}
}
We're also missing character classes like [a-z]
, but this post is already long enough without them.
They could be handled like Regex::Literal
but with a list of ranges instead of a single char
.
Everything below this point is actually incorrect. Click to show it anyway.
Regular languages are closed under complement, which means that for every regex there is another that matches the exact opposite set of strings.
Unfortunately, this complementary regex can be pretty ugly.
Even a simple one like hello
has a pretty terrible complement:
|([^h]|h[^e]|he[^l]|hel[^l]|hell[^o]|hello.).*
Typically, users negate their matches on a layer above the regex syntax itself, like with grep -v
or if !regex.matches(text)
.
But sometimes it would be useful to have dedicated syntax for inverting a regex, especially just part of a regex.
We can invent a !
operator to express this.
enum Regex {
...
/// Matches the opposite of a pattern.
Not(Box<Regex>),
}
fn regex(pattern: &str) -> IResult<&str, Regex> {
...
let meta = r"\().*|+?!";
...
let not_word = many0_count(char('!'))
.and(word)
.map(|(n, r)| {
(0..n).fold(r, |r, _| r.not())
});
let chunk = alt((not_word, empty));
...
}
Now we can write the opposite of hello
much more simply: !hello
.
Finding the complement of an arbitrary NFA is expensive: you first convert the NFA to a DFA, which can blow up its size exponentially, and then flip the accepting/rejecting states. Lucky for us, we don't have to.
/// Matcher for Regex::Not.
struct Not<A>(A);
impl<A: Matcher> Matcher for Star<A> {
fn start(&mut self) -> bool {
!self.0.start()
}
fn push(&mut self, c: char) -> bool {
!self.0.push(c)
}
}
This is the key insight of the whole post: rather than actually finding the complement of an NFA, we can just write a Matcher
implementation that acts like it.
The !
operator lets us easily match things like "lines that don't contain hello", but it also opens the door to another more useful feature.
Sometimes we might have two patterns and want both of them to match.
For example, we might want to find lines that contain both hello
and world
.
Regular languages are also closed under intersection, so we should be able to write such a regex.
Again, though, it might be pretty long.
.*hello.*world.*|.*world.*hello.*
This quickly gets out of hand with more than two patterns.
But with the !
operator, we can use one of De Morgan's laws,
(A && B) == !(!A || !B)
to do the same thing without repeating ourselves:
!(!(.*hello.*)|!(.*world.*))
This is still not the easiest regex to read (strings that don't (not contain hello
) or (not contain world
)), so let's add an &
operator to express the concept more directly:
(.*hello.*)&(.*world.*)
enum Regex {
...
/// Matches the intersection of two patterns.
And(Box<Regex>, Box<Regex>),
}
fn regex(pattern: &str) -> IResult<&str, Regex> {
...
let meta = r"\().*|&";
...
let clause = separated_list1(char('&'), chunk)
.map(|v| reduce(v, Regex::and));
separated_list1(char('|'), clause)
.map(|v| reduce(v, Regex::or))
.parse(pattern)
}
We don't even need a new matcher,
impl Regex {
/// Compile a regular expression.
fn matcher(&self) -> Box<dyn Matcher> {
match self {
...
Self::And(a, b) => Box::new(
a.matcher().not()
.or(b.matcher().not())
.not()
),
}
}
}
but it's not hard to write one anyway.
/// Matcher for Regex::And.
struct And<A, B>(A, B);
impl<A: Matcher, B: Matcher> Matcher for And<A, B> {
fn start(&mut self) -> bool {
self.0.start() & self.1.start()
}
fn push(&mut self, c: char) -> bool {
self.0.push(c) & self.1.push(c)
}
}
Looking around
As far as I know, our !
and &
operators are not features of any major regex implementation.
But a similar feature called lookaround or positive/negative lookahead/lookbehind is found in many regex engines.
Feature | Syntax | Irregex equivalent |
---|---|---|
Positive lookahead | A(?=B)C | A((B.*)&C) |
Negative lookahead | A(?!B)C | A((!B.*)&C) |
Positive lookbehind | A(?<=B)C | (A&(.*B))C |
Negative lookbehind | A(?<!B)C | (A&(!.*B))C |
As far as I know, this feature is only provided by regex engines that have worst-case exponential complexity. The two major worst-case linear regex engines that I'm aware of, RE2 and Rust's regex crate, don't support it.
As a matter of principle, RE2 does not support constructs for which only backtracking solutions are known to exist. Thus, backreferences and look-around assertions are not supported.
Notably, backreferences and arbitrary lookahead/lookbehind assertions are not provided. In return, regular expression searching provided by this package has excellent worst-case performance.
But as this post shows, it is certainly possible to support lookaround (not backreferences!) in linear time. There are some unanswered questions (what about capture groups?), but I'm curious if this implementation strategy would be workable in a production-quality regex engine.