Parser Combinators Beat Regexes

mooreds | 119 points

Parser combinators are one of those ideas that is really powerful in practice, but will never be mainstream because it had the bad fortune of coming from the functional programming world. It's a shame too, because parser combinators are what made parsers click for me.

Take the hard problem of parsing a language, break it down into the much simpler problem of reading tokens, then write simple "combinators" (and, or, one_or_many, etc etc) to combine these simple parsers into larger more complex parsers.

You could almost build an entire parser combinator library from scratch from memory, because none of the individual components are complicated.

yen223 | 3 days ago

There are numerous posts, comments, articles and so forth about when to use regex versus using a parser. The general rule is this:

If you need to do a search from a string, such as needle(s) from a hat stack, regex is probably more ideal than a parser. If you need anything more intelligent than a list of search results you probably want a full formal parser.

Most languages allow a form of nested regex that allow for increased search precision. This occurs when a method that makes use of a regex returns to a function whose argument is a matching string result, which is why regex is probably enough when the business is primitive. There is a tremendously higher cost to using a full parser, considering the lexer and tokenizer plus rules, but it’s so much more intelligent that it’s not even comparable.

austin-cheney | 3 days ago

> It uses the pcre-heavy Haskell library which in turn calls out to the system-wide pcre C library for actually compiling and running the regex.

That's not true regular expressions, but a backtracker with eclectic features.

The regex used in the regex solution is just:

  mul\((\d+),(\d+)\)
not requiring PCRE.

If Haskell happens to provide bindings to the POSIX regcomp/regexec stuff, that would be something to try instead. \d has to be replaced with [0-9], needless to say.

kazinator | 3 days ago

“In other languages, it would be considered overkill to write a full parser when a simple regex can do the same thing. In Haskell, writing a parser is no big deal. We just do it and move on with our lives.”

I see a long code file filled with comments and long paragraph-level explanations. I think I’d rather just learn and use regex.

arn3n | 3 days ago

Note that most implementations of both parser combinators and regexes can fail very badly (exponential time). Never use either on untrusted input, unless you can prove your implementation lets you stay linear.

o11c | 3 days ago

Parser combinators work well for big messy problems as well as regexes; I used a custom library of combinators to write the Fortran parser for LLVM. They’re great for error recovery and for disambiguating tricky situations, of which Fortran has more than its share; I ended up parsing cooked characters without a tokenization phase.

One hard part is getting good performance when the same production is being tried as part of multiple higher level alternatives and backtracking is taking place. You can get into trouble for bad input cases, like expressions with thousands of operands being used in a context that isn’t clear until you get to characters after the expression. But there’s ways to solve that problem, so I recommend combinators to you if you need to write a parser.

pklausler | 2 days ago

I've been working a lot with parser combinators lately, they are a really nice compromise between using grammar languages (I'd put regexes in this category, as well as parser generators) and hand-rolled imperative parsers. The parser combinator library ends up being a DSL for describing grammars in your preferred language, but the boundary is fluid, and you can move back and forth between writing grammars and writing code in your preferred language.

The bold implementer could integrate regexes into their parser combinator library, and you could move between all three.

maxbond | 3 days ago

A few days ago I mentioned my first parser [1], which used regex in the tokenization step, based on studying the parser video by Destroy All Software. I'm glad I learned it that way first, since it takes a lot of the pain out of lexing. I've now built many parsers, including accidentally re-inventing parser combinators when working in Erlang. Two days ago I built another parser that uses regex heavily for tokenization/some parsing steps. I never parse programming languages, so my programs' needs are quite different from the parsers talked about online!

https://news.ycombinator.com/item?id=43627698

Rendello | 2 days ago

You can also use simple string search functions for this Advent of Code problem.

I used this approach for implementing orthography rules in a stenography application (implementing rules like `artistic + ly = artistically`) and in the benchmarks I did, a few lines of searching for character indexes was an order of magnitude faster than regexes. Each of my functions is about 3-10 ish lines of simple code compared to a 1-line regex.

You do have to handle cases like the character not being found, but I've had zero issues with the code and, in terms of what actually executes, it is vastly simpler. This is why the saying goes that if you make something twice as fast, you may have done something clever, but if you make it 10x faster, you stopped doing something dumb.

henning | 3 days ago

Parser combinators are great until you need to parse something real, like CSV with embedded newlines and Excel quotes. That’s when you reach for the reliable trio: awk, duct tape, and prayer.

DadBase | 3 days ago
[deleted]
| 3 days ago

It makes sense, Combinators can parse a superset of formal languages than regex, according to Chomsky's hierarchy [0].

[0]: https://en.wikipedia.org/wiki/Chomsky_hierarchy

kreig | 2 days ago

Functional programming languages are uniquely suitable for writing parsers for grammars, their whole feature set (discriminated unions, pattern matching etc.) is very suited to turning text into ASTs. It's not often that one needs to parse a custom DSL grammar.

That said, I'd say it's a pretty rare occasion to have that need, and other languages have great libraries for accomplishing the same, I'm partial to Chevrotain for Javascript (https://chevrotain.io/docs/), which has a small DSL for describing grammars.

Parser combinators are both less efficient and less featureful than bespoke solutions, which is a problem when you're writing a parser for your custom language, and you want to add extra features, like resilience, or rich error reporting.

To add a bit of snark, functional languages seem to best suited for writing parsers/compilers for other functional programming languages, which is pretty telling of the inward-facing functional culture.

torginus | 3 days ago

This is a false dichotomy -- regexes and parsers both have their place, even when solving the same problem.

The troubles start when you try and solve the whole thing in one step, using just regular expressions, or parsers.

Regular expressions are good at tokenizing input (converting a stream of bytes into a stream of other things, e.g. picking out numbers, punctuation, keywords).

Parsers are good at identifying structure in a token stream.

Neither are good at evaluation. Leave that as its own step.

Applying this rule to the example in the article (Advent of Code 2024 Day 3), I would still use regular expressions to identify mul(\d+,\d+), do(), and don't(), I don't think I need a parser because there is no extra structure beyond that token stream, and I would leave it up to the evaluator to track the state of whether multiplication is enabled or not.

asQuirreL | 3 days ago

Parser combinators will eventually generate a push-down automaton. Regex would eventually generate a deterministic finite-state automaton. There's no contest, CPU-wise the DFA will parse the string faster (just have to do one transition for each char/ it's one lookup). In some cases it might take more memory for the automaton though. (also this assumes you don't go into library-specific extensions that allow backtracking, but keep within the original regex scope)

It's fine and fair to suggest that parser combinators are better for many usecases (first of all, they're clearly more powerful! they can parse any context-free language). But it's misleading to suggest they "beat" regex (they don't - not on the regex turf).

virgilp | 3 days ago

Obligatory plug for Sprache (https://github.com/sprache/Sprache) - a nice library I've been using whenever I've needed to create a little parser for something in C#. I'm not dogmatically opposed to using regexes, just for me they feel quite clunky for many tasks and tend to grow legs and become harder to understand.

smcl | 3 days ago