3 min read

Your CS teacher is wrong about Regex

by Vihan Bhargava
  • Regex
  • Grammars

Basically everywhere you turn there's someone who is telling you not too use regex. While that does have a level of validity to it, I think it has come more towards a level of misunderstanding.

Basically everywhere you turn there’s someone who is telling you not too use regex. While that does[1] have a level of validity to it (read later on as I debunk this fallacy), I think it has come more towards a level of misunderstanding.

#What is Regex?

(you could skip this part but not recommended if you are not truly experienced with regex)

Regular expressions aren’t a horrible evil thing like they’re oh to commonly portrayed. They are merely another representation of the grammars you learned in CS class:

AAaAaΣn\begin{array}{rl} A \rightarrow& A a \mid A \\ a \rightarrow& \Sigma_n \end{array}

But regular expressions is not as much greek letters; therefore, you may have grown used to seeing regexes such as:

(^|\b)(([d]?\()?[QqTt;.Oo><^*~e'ಠov$\-+!?`]([o.^_m-]+|╭╮)[QqTt;.Oo><^*~e'ಠov$\-+!?`](\)[b]?)?|([IlXxDd)(<\[#LCcOoq\/\\*][-]?['^"]?[;:8B|][<]?|[>]?[;:8B|Xx]['^"]?[-]?[7IlXxD)(>\]3#|LCcOoPp\/\\*])|[.\\\/]+[om][.\\\/]+|(¯\\*_\(ツ\)_\/¯)|\( ͡° ͜ʖ ͡°\)|[<>]?[}\])]:\(?[:o]\)|\([:o]\):[(\[{][<>]?|@--}--|<[\/\\]?3|\([╯」][°゜][□ロ][°゜][)\)][╯」](︵\ ┻━┻)?)(\b|$)

HALP WAT
HALP WAT

Fun Fact: this is a regex I wrote to match most emoticons

I will not deny such regexes are definitely daunting and look like a giant hack, but regexes are not a hack. In fact your regex engine can probably emulate any context-free grammar[2]. I don’t mean this theoretically even, I mean quite nicely.

#Welcome to 21st century regex

Okay so I’m going to take a detour and introduce to you how I think you should write regular expressions.


First let’s get vocab out of the way:

  • (?(DEFINE)...) this is a section the regex overlords have decided to take upon themselves to have completely ignored (mean I know right?).

In this context: definitions section

  • (?<...>...)[3] a named capture group, like (...).

In this context: a function

  • (?&...) recurses a named capture group, basically necessary without (?R) magic.

In this context: calls a function


So here’s your new string parsing regex:

(?(DEFINE)
     (?<STRING>
          "(?&BODY)"
     )

     (?<BODY>
          (
              (?&ESCAPE) | [^"]
          )*
     )

     (?<ESCAPE>
          \\.
     )
)

(?&STRING)

Anyone remotely familiar with grammars should instantly find this recognizable. Let me break it down:

My editing skills are way out of your league, I
know
My editing skills are way out of your league, I know

this seems familiar…

N,Σ,R,S\left\langle N,\Sigma, R, S \right\rangle

sed and grep’s expressions aren’t the bext of regex… this is.

#So What?

This is our next big question. So you can emulate “context-free grammars” but why does that redeem Regex?

Well this is exactly why I created this post.

You can’t parse [X]HTML with regex

source

and everytime I dispute them someone responds with something along the lines of.

You shouldn’t use regex to parse [X]HTML

which I also call BS on (to an extent).

#The XHTML myth

Did you say you can’t parse XHTML in regex?[4]

(?(DEFINE)
  (?<XHTML>
    (?&comment)*\s*
    (?&DOCTYPE)
    (?&comment)*\s*
    (?&HTML)
    (?&comment)*\s*
  )

  # HTML element
  (?<HTML>
    <html(?&attrs)?>(?&content)<\/html>
  )

  # Match content
  (?<content>
    \s*
    (?:
      ((?&tag) | [^<>]+)\s*
    )*
  )

  # General tag
  (?<tag>
     <((?&tagname))(?&attrs)?\s*(?:
       \/>|
       >\s*(?&content)\s*
       <\/\g'-1'>
     )|(?&comment)
  )

  # Attributes
  (?<attrs>\s+
    # The name
    (?&keyword) (
      \s*=\s*
      (?:
        (?&keyword)|(?&string)
      )
    )?
    (?&attrs)?
  )

  (?<string>
    "(?:\\.|.)+?"|
    '(?:\\.|.)+?'
  )

  (?<comment>
    \s*<!--(.+?)-->\s*
  )

  # Match keyword
  (?<keyword>[^\s\/>"'=]+)
  # Match tag name
  (?<tagname>(?!xml)[A-Za-z_][A-Za-z\d_.-]*)

  # DOCTYPE expression
  (?<DOCTYPE>
     <!doctype\s+x?html\s*
       # Match the PUBLIC "foo" stuff I have
       # no idea what even means
       (public\s*(?&string))?
       # Series of random strings
       (\s+(?&string))*
     >
  )
)

^\s*(?&XHTML)\s*$

I have done the most blasphemous thing one can and make a regex to parse XHTML[4:1].

Go ahead, try it out regex101 link for those interested.

The point here however is that is works, and I’m using regex, only regex. Sure some parts may seem complicated but anyone who is knowledgeable with the flavor I used (PCRE) can decipher it.

#Conclusion

So for anyone on the internet who claims about how regex only can perform “regular grammars”, please slam this article in their face.

Today’s regular expression are not regular grammars. I understand they do share a word, but we are not in 1950s anymore and I think it’s time we moved on to accepting regexs can do more than match a file path.

The point of this article was not to insult people who question the supremacy and power versatility of regular expressions but much rather to disillusion the common misconception around regular expressions.

So the next time you see someone give a regular expression solution on Stack Overflow… please don’t kill them.

One last thing, is that you do not need fancy PCRE regular expressions. Sure they have the nicest syntax but a whole range of flavors have extended mode (comments + free whitespace). The (?(DEFINE)...) block is also not too hard to emulate along and group recursion is supported with (?0) or \g<0> in some flavors.


If you have any questions doesn’t hesitate to leave a comment. If you think any part of this article is utter BS, leave a comment about that too.


  1. We really need to differentiate between “can” and “should”a ↩︎

  2. i.e. what the programming language you like probably is parsed with. ↩︎

  3. If you didn’t know this, consider yourself enlightened ↩︎

  4. May not perfectly follow spec but can be trivially modified to fit. ↩︎ ↩︎