Irregular expressions

2023-03-07 Tavian Barnes GitHub Hacker News

Regular expressions are fascinating to me. On one hand, they can be extremely succinct, expressive, and efficient. On the other hand, they can be basically write-only. They come with a simple but powerful theory that leads to efficient implementations. Sadly, many implementations ignore the theory in order to offer additional features, at the cost of worst-case exponential complexity.

~~It is possible, however, to implement some of those additional features, and still operate in worst-case linear time.~~ It was pointed out to me by Geoff Langdale and Evgeny Kapun that my implementation of those additional features is incorrect. I've hidden the wrong implementations for now. The implementation (~400 lines of Rust) even fits in a single blog post! The full code is on GitHub, and the commit history lines up with the blog post if you want to follow along.

State machines

The key to worst-case linear running time is to use Nondeterministic Finite Automata (NFAs). An NFA is basically a flow chart for deciding whether to accept or reject a string. The N in NFA means that multiple states can be active at the same time.

Before actually implementing NFAs, it's nice to start with an interface for them:

/// A regular expression matcher.
trait Matcher {
    /// Activate this matcher's *start* state.
    fn start(&mut self) -> bool;

    /// Process a character from the string we're matching against.
    fn push(&mut self, c: char) -> bool;
}

A Matcher starts out with no active states. Matcher::start() activates the start state, and returns whether the matcher is currently in an accepting state (which could happen if, for example, the regex matches an empty string). To test if a string matches, you feed its characters one-by-one to Matcher::push(), or just call this helper method to do it for you:

trait Matcher {
    ...
    /// Test if a string matches.
    fn matches(&mut self, text: &str) -> bool {
        text.chars().fold(self.start(), |_, c| self.push(c))
    }
}

Much ado about nothing

Commit

The simplest regex is the empty regex, so let's start with that.

/// A regular expression.
enum Regex {
    /// The empty regex.
    Empty,
}

The empty matcher has only one state.

/// Matcher for Regex::Empty.
#[derive(Default)]
struct Empty {
    matched: bool,
}

impl Matcher for Empty {
    fn start(&mut self) -> bool {
        self.matched = true;
        true
    }

    fn push(&mut self, _: char) -> bool {
        self.matched = false;
        false
    }
}

It would be convenient if we could parse regexes instead of having to write things like Regex::Empty explicitly. We can use nom to build a parser, starting with a simple grammar.

$\begin{aligned} \text{Eᴍᴘᴛʏ} & \rarr \varepsilon \\[2pt] \text{Gʀᴏᴜᴘ} & \rarr \texttt{(} \; \text{Rᴇɢᴇx} \; \texttt{)} \\[2pt] \text{Rᴇɢᴇx} & \rarr \text{Gʀᴏᴜᴘ} \mid \text{Eᴍᴘᴛʏ} \\ \end{aligned}$

/// Parser implementation.
fn regex(pattern: &str) -> IResult<&str, Regex> {
    let empty = success(Regex::Empty);

    let group = delimited(char('('), regex, char(')'));

    alt((group, empty))
        .parse(input)
}

Dot dot dot

Commit

The next-simplest regex is the . metacharacter, which matches any character.

enum Regex {
    ...
    /// Matches any one character.
    Dot,
}

The matcher for . has two states.

/// Matcher for Regex::Dot.
#[derive(Default)]
struct Dot {
    started: bool,
    matched: bool,
}

impl Matcher for Dot {
    fn start(&mut self) -> bool {
        self.started = true;
        self.matched
    }

    fn push(&mut self, _: char) -> bool {
        self.matched = self.started;
        self.started = false;
        self.matched
    }
}

It's a small addition to the parser as well.

$\begin{aligned} \text{Dᴏᴛ} & \rarr \texttt{.} \\[2pt] \text{Rᴇɢᴇx} & \rarr \text{Dᴏᴛ} \end{aligned}$

fn regex(pattern: &str) -> IResult<&str, Regex> {
    ...
    let dot = char('.').map(|_| Regex::Dot);

    alt((dot, group, empty))
        .parse(input)
}

Literally me

Commit

The next regex to support is character literals like c that match only themselves.

enum Regex {
    ...
    /// Matches a literal character.
    Literal(char),
}

The matcher for literals is very similar to Dot.

/// Matcher for Regex::Literal.
struct Literal {
    c: char,
    started: bool,
    matched: bool,
}

impl Matcher for Literal {
    fn start(&mut self) -> bool {
        self.started = true;
        self.matched
    }

    fn push(&mut self, c: char) -> bool {
        self.matched = self.started && c == self.c;
        self.started = false;
        self.matched
    }
}

The parser is extended with support for literals, as well as escape sequences like \. to allow matching metacharacters literally.

$\begin{aligned} \text{Mᴇᴛᴀ} & \rarr \texttt{\textbackslash} \mid \texttt{(} \mid \texttt{)} \mid \texttt{.} \\[2pt] \text{Lɪᴛᴇʀᴀʟ} & \rarr \text{any non-Mᴇᴛᴀ} \\ & \hphantom{{}\rarr{}} \text{character} \\[2pt] \text{Esᴄᴀᴘᴇ} & \rarr \texttt{\textbackslash} \; \text{Mᴇᴛᴀ} \\[2pt] \text{Rᴇɢᴇx} & \rarr \text{Lɪᴛᴇʀᴀʟ} \mid \text{Esᴄᴀᴘᴇ} \\ \end{aligned}$

fn regex(pattern: &str) -> IResult<&str, Regex> {
    ...
    let meta = r"\().";

    let literal = none_of(meta).map(Regex::Literal);

    let escape = preceded(char('\\'), one_of(meta))
        .map(Regex::Literal);

    alt((literal, escape, dot, group, empty))
        .parse(input)
}

A star is born

Commit

So far our regexes can only match up to a single character. The next step is to support the Kleene star operator *, for regexes like a* that match "", "a", "aa", "aaa", etc.

enum Regex {
    ...
    /// Matches zero or more repetitions.
    Star(Box<Regex>),
}

The NFA for the * operator adds a loop with $\varepsilon$ -transitions to let the pattern match multiple times. It doesn't even need any states of its own.

/// Matcher for Regex::Star.
struct Star<A>(A);

impl<A: Matcher> Matcher for Star<A> {
    fn start(&mut self) -> bool {
        self.0.start() || true
    }

    fn push(&mut self, c: char) -> bool {
        self.0.push(c) && self.start()
    }
}

To parse the * operator properly we have to shuffle some grammar rules around.

$\begin{aligned} \text{Mᴇᴛᴀ} & \rarr \texttt{*} \\[2pt] \text{Aᴛᴏᴍ} & \rarr \text{Lɪᴛᴇʀᴀʟ} \mid \text{Esᴄᴀᴘᴇ} \\ & \rarr \text{Dᴏᴛ} \\ & \rarr \text{Gʀᴏᴜᴘ} \\[2pt] \text{Rᴇᴘᴇᴀᴛ} & \rarr \text{Aᴛᴏᴍ} \\ & \rarr \text{Rᴇᴘᴇᴀᴛ} \; \texttt{*} \\[2pt] \text{Rᴇɢᴇx} & \rarr \text{Rᴇᴘᴇᴀᴛ} \mid \text{Eᴍᴘᴛʏ} \\ \end{aligned}$

fn regex(pattern: &str) -> IResult<&str, Regex> {
    ...
    let meta = r"\().*";
    ...
    let atom = alt((literal, escape, dot, group));

    let repeat = flat_map(atom, |r| fold_many0(
        char('*'),
        move || r.clone(),
        |r, _| r.star(),
    ));

    alt((repeat, empty))
        .parse(input)
}

One is the loneliest number

Commit

It's a bit sad that our regexes can currently only have one (non-meta) character. We can handle a, a*, and even the questionably-useful a**, but in order to support regexes like ab*c, we need our first binary operator: concatenation.

enum Regex {
    ...
    /// Matches two patterns in a row.
    Concat(Box<Regex>, Box<Regex>),
}

The matcher for concatenated patterns connects the accept state of the first to the start state of the second.

/// Matcher for Regex::Concat.
struct Concat<L, R> {
    left: L,
    right: R,
    right_started: bool,
}

impl<L: Matcher, R: Matcher> Matcher for Concat<L, R> {
    fn start(&mut self) -> bool {
        if self.left.start() {
            self.right_started = true;
            self.right.start();
        } else {
            false
        }
    }

    fn push(&mut self, c: char) -> bool {
        let mut ret = false;

        if self.right_started {
            ret |= self.right.push(c);
        }

        if self.left.push(c) {
            self.right_started = true;
            ret |= self.right.start();
        }

        ret
    }
}

Just one more grammar rule is all we need to parse concatenations.

$\begin{aligned} \text{Wᴏʀᴅ} & \rarr \text{Rᴇᴘᴇᴀᴛ} \\ & \rarr \text{Wᴏʀᴅ} \; \text{Rᴇᴘᴇᴀᴛ} \\[2pt] \text{Rᴇɢᴇx} & \rarr \text{Wᴏʀᴅ} \mid \text{Eᴍᴘᴛʏ} \\ \end{aligned}$

fn regex(pattern: &str) -> IResult<&str, Regex> {
    ...
    let word = many1(repeat)
        .map(|v| reduce(v, Regex::concat));

    alt((word, empty))
        .parse(input)
}

This or that

Commit

Our next feature is the alternation operator |, which lets us match either (or both) of two patterns.

enum Regex {
    ...
    /// Matches either of two patterns.
    Or(Box<Regex>, Box<Regex>),
}

The matcher for alternations joins both patterns together with $\varepsilon$ -transitions.

/// Matcher for Regex::Or.
struct Or<A, B>(A, B);

impl<A, B> Matcher for Or<A, B>
where
    A: Matcher,
    B: Matcher,
{
    fn start(&mut self) -> bool {
        self.0.start() | self.1.start()
    }

    fn push(&mut self, c: char) -> bool {
        self.0.push(c) | self.1.push(c)
    }
}

A couple more grammar rules are all we need.

$\begin{aligned} \text{Mᴇᴛᴀ} & \rarr \texttt{|} \\[2pt] \text{Cʜᴜɴᴋ} & \rarr \text{Wᴏʀᴅ} \mid \text{Eᴍᴘᴛʏ} \\[2pt] \text{Rᴇɢᴇx} & \rarr \text{Cʜᴜɴᴋ} \\ & \rarr \text{Rᴇɢᴇx} \; \texttt{|} \; \text{Cʜᴜɴᴋ} \\ \end{aligned}$

fn regex(pattern: &str) -> IResult<&str, Regex> {
    ...
    let meta = r"\().*|";
    ...
    let chunk = alt((word, empty));

    separated_list1(char('|'), chunk)
        .map(|v| reduce(v, Regex::or))
        .parse(pattern)
}

Feature-complete

Commit, commit

The features we implemented so far are enough to express every regular language—in other words, to match anything that could be matched by a regular expression. Actually, we didn't even need ., though it would be very inconvenient to have to write

(a|b|c|...|A|B|C|...|0|1|2|...|!|@|#|...)

instead.

There are still a few other features that regex engines commonly provide, such as the + (repeat 1 or more times) and ? (match 0 or 1 times) operators, that we could easily implement.

enum Regex {
    ...
    /// Matches one or more repetitions.
    Plus(Box<Regex>),
    /// Matches zero or one times.
    Maybe(Box<Regex>),
}

$\begin{aligned} \text{Mᴇᴛᴀ} & \rarr \texttt{+} \mid \texttt{?} \\[2pt] \text{Rᴇᴘᴇᴀᴛ} & \rarr \text{Rᴇᴘᴇᴀᴛ} \; \texttt{+} \\ & \rarr \text{Rᴇᴘᴇᴀᴛ} \; \texttt{?} \\ \end{aligned}$

fn regex(pattern: &str) -> IResult<&str, Regex> {
    ...
    let meta = r"\().*|+?";
    ...
    let repeat = flat_map(atom, |r| fold_many0(
        one_of("*+?"),
        move || r.clone(),
        |r, c| match c {
            '*' => r.star(),
            '+' => r.plus(),
            '?' => r.maybe(),
            _ => unreachable!(),
        },
    ));
    ...
}

We don't even need any new matchers; we can just re-use existing ones.

impl Regex {
    /// Compile a regular expression.
    fn matcher(&self) -> Box<dyn Matcher> {
        match self {
            ...
            Self::Plus(a) => Box::new(
                // A+ is the same as AA*
                a.matcher().concat(a.matcher().star())
            ),
            Self::Maybe(a) => Box::new(
                // A? is the same as (A|)
                a.matcher().or(Empty::default())
            ),
        }
    }
}

We're also missing character classes like [a-z], but this post is already long enough without them. They could be handled like Regex::Literal but with a list of ranges instead of a single char.

Everything below this point is actually incorrect. Click to show it anyway.

Not doing the work

Commit

Regular languages are closed under complement, which means that for every regex there is another that matches the exact opposite set of strings. Unfortunately, this complementary regex can be pretty ugly. Even a simple one like hello has a pretty terrible complement:

|([^h]|h[^e]|he[^l]|hel[^l]|hell[^o]|hello.).*

Typically, users negate their matches on a layer above the regex syntax itself, like with grep -v or if !regex.matches(text). But sometimes it would be useful to have dedicated syntax for inverting a regex, especially just part of a regex. We can invent a ! operator to express this.

enum Regex {
    ...
    /// Matches the opposite of a pattern.
    Not(Box<Regex>),
}

$\begin{aligned} \text{Mᴇᴛᴀ} & \rarr \texttt{!} \\[2pt] \text{NᴏᴛWᴏʀᴅ} & \rarr \text{Wᴏʀᴅ} \\ & \rarr \texttt{!} \; \text{NᴏᴛWᴏʀᴅ} \\[2pt] \text{Cʜᴜɴᴋ} & \rarr \text{NᴏᴛWᴏʀᴅ} \mid \text{Eᴍᴘᴛʏ} \\ \end{aligned}$

fn regex(pattern: &str) -> IResult<&str, Regex> {
    ...
    let meta = r"\().*|+?!";
    ...
    let not_word = many0_count(char('!'))
        .and(word)
        .map(|(n, r)| {
            (0..n).fold(r, |r, _| r.not())
        });

    let chunk = alt((not_word, empty));
    ...
}

Now we can write the opposite of hello much more simply: !hello.

Finding the complement of an arbitrary NFA is expensive: you first convert the NFA to a DFA, which can blow up its size exponentially, and then flip the accepting/rejecting states. Lucky for us, we don't have to.

/// Matcher for Regex::Not.
struct Not<A>(A);

impl<A: Matcher> Matcher for Star<A> {
    fn start(&mut self) -> bool {
        !self.0.start()
    }

    fn push(&mut self, c: char) -> bool {
        !self.0.push(c)
    }
}

This is the key insight of the whole post: rather than actually finding the complement of an NFA, we can just write a Matcher implementation that acts like it. The ! operator lets us easily match things like "lines that don't contain hello", but it also opens the door to another more useful feature.

Yes, and

Commit

Sometimes we might have two patterns and want both of them to match. For example, we might want to find lines that contain both hello and world. Regular languages are also closed under intersection, so we should be able to write such a regex. Again, though, it might be pretty long.

.*hello.*world.*|.*world.*hello.*

This quickly gets out of hand with more than two patterns. But with the ! operator, we can use one of De Morgan's laws,

(A && B) == !(!A || !B)

to do the same thing without repeating ourselves:

!(!(.*hello.*)|!(.*world.*))

This is still not the easiest regex to read (strings that don't (not contain hello) or (not contain world)), so let's add an & operator to express the concept more directly:

(.*hello.*)&(.*world.*)

enum Regex {
    ...
    /// Matches the intersection of two patterns.
    And(Box<Regex>, Box<Regex>),
}

$\begin{aligned} \text{Mᴇᴛᴀ} & \rarr \texttt{\&} \\[2pt] \text{Cʟᴀᴜsᴇ} & \rarr \text{Cʜᴜɴᴋ} \\ & \rarr \text{Cʟᴀᴜsᴇ} \; \texttt{\&} \; \text{Cʜᴜɴᴋ} \\[2pt] \text{Rᴇɢᴇx} & \rarr \text{Cʟᴀᴜsᴇ} \\ & \rarr \text{Rᴇɢᴇx} \; \texttt{|} \; \text{Cʟᴀᴜsᴇ} \\ \end{aligned}$

fn regex(pattern: &str) -> IResult<&str, Regex> {
    ...
    let meta = r"\().*|&";
    ...
    let clause = separated_list1(char('&'), chunk)
        .map(|v| reduce(v, Regex::and));

    separated_list1(char('|'), clause)
        .map(|v| reduce(v, Regex::or))
        .parse(pattern)
}

We don't even need a new matcher,

impl Regex {
    /// Compile a regular expression.
    fn matcher(&self) -> Box<dyn Matcher> {
        match self {
            ...
            Self::And(a, b) => Box::new(
                a.matcher().not()
                    .or(b.matcher().not())
                    .not()
            ),
        }
    }
}

but it's not hard to write one anyway.

/// Matcher for Regex::And.
struct And<A, B>(A, B);

impl<A: Matcher, B: Matcher> Matcher for And<A, B> {
    fn start(&mut self) -> bool {
        self.0.start() & self.1.start()
    }

    fn push(&mut self, c: char) -> bool {
        self.0.push(c) & self.1.push(c)
    }
}

Looking around

As far as I know, our ! and & operators are not features of any major regex implementation. But a similar feature called lookaround or positive/negative lookahead/lookbehind is found in many regex engines.

Feature	Syntax	Irregex equivalent
Positive lookahead	`A(?=B)C`	`A((B.*)&C)`
Negative lookahead	`A(?!B)C`	`A((!B.*)&C)`
Positive lookbehind	`A(?<=B)C`	`(A&(.*B))C`
Negative lookbehind	`A(?<!B)C`	`(A&(!.*B))C`

As far as I know, this feature is only provided by regex engines that have worst-case exponential complexity. The two major worst-case linear regex engines that I'm aware of, RE2 and Rust's regex crate, don't support it.

As a matter of principle, RE2 does not support constructs for which only backtracking solutions are known to exist. Thus, backreferences and look-around assertions are not supported.

— https://github.com/google/re2/issues/156

Notably, backreferences and arbitrary lookahead/lookbehind assertions are not provided. In return, regular expression searching provided by this package has excellent worst-case performance.

— https://github.com/rust-lang/regex/issues/127

But as this post shows, it is certainly possible to support lookaround (not backreferences!) in linear time. There are some unanswered questions (what about capture groups?), but I'm curious if this implementation strategy would be workable in a production-quality regex engine.

tavianator.com