Regular Expression Tips for People That Hate Regular Expressions

I’ve been writing regular expressions (regex) for years and would consider myself pretty good at them by now (I decline to provide references). The problem is that although it’s common to see regex as cryptic and painfully terse (which are not invalid points), they remain one of the best ways to handle complex pattern matching and as such you’ll probably run into them at least a few times in your career.

Even if you really despise writing regular expressions, here are some tips that will help to make that process as painless as possible!

Use Regular Expressions Only Where Appropriate

I think this is one of the biggest sticking points people run into when writing regular expressions and the reason the infamous “two problems” quote rings true for so many people. Although regular expressions feel flexible enough to be suitable for just about any text matching problem, there’s a limit to what problems they can actually solve. I’m not smart enough to explain it myself, so I’ll share this wonderful answer from Stack Overflow responding to a question about parsing XHTML:

HTML is a Chomsky Type 2 grammar (context free grammar) and a regular expression is a Chomsky Type 3 grammar (regular grammar). Since a Type 2 grammar is fundamentally more complex than a Type 3 grammar (see the Chomsky hierarchy), it is mathematically impossible to parse XML with a regular expression.

Community – RegEx match open tags except XHTML self-contained tags

This is worth diving into further if you’re interested, but at the very least it’s important to know that there is a logical limit for what regular expressions are capable of.

It’s hard to clearly describe what makes regular expressions “appropriate” for a particular problem since every situation has it’s own subtleties to contend with, but in general I try to only use regular expressions for situations where other approaches simply don’t work as well or have some specific drawback. No matter how much I like regular expressions, I’m happy to admit that they can quickly complicate things if used too liberally!

Mix in Other Processing Techniques

Ask yourself: even if regular expressions would be practical here, does it all need to be done by a regular expression? I’ve personally come across so many situations where I’ve written a complex regex only to realise that I could have done a first pass on the data to get the main chunks out and work on each part separately! If you can use a language/framework feature that is better suited for solving that part of the problem, do it!

Really, this is a continuation of the first tip; if you can avoid writing regular expressions by using a different approach, see if it is better suited!

Split up Your Regular Expressions

Even if you’re working on a problem that you feel would be best handled by regular expressions, consider breaking down your patterns into smaller chunks that each handle a distinct part of the input. I’m sure I’m not the only one who has seen expressions that have spilled off the side of the screen, but when that happens it’s often because they’re trying to do too much at once.

Here’s an example for you. Imagine we have the following data:

C12345
ABCDEFGHIJ
ABC999

Let’s say that line 1 is a customer number (because it starts with a C and has 5 numbers after it), line 2 is a product number (because it only contains letters and is 10 characters long) and line 3 is a transaction number (because it contains 3 letters followed by 3 numbers).

Now imagine you come across the following code (written in C#) to verify whether the input the user gave us is one of the three available valid formats:

var customerOrProductOrTransactionNumberPattern = @"^[Cc]\d{5}$|^[A-Za-z]{10}$|^[A-Za-z]{3}\d{3}$";

foreach (var inputValue in inputList)
{
    if (Regex.IsMatch(inputValue, customerOrProductOrTransactionNumberPattern))
    {
        Console.WriteLine($"{inputValue} matches");
    }
    else
    {
        Console.WriteLine($"{inputValue} doesn't match");
    }
}

Does it do the job? Yes. Does it do it in a way that is clear to anyone reading it? Not really!

Let’s make some changes:

var customerNumberPattern = @"^[Cc]\d{5}$";
var productNumberPattern = @"^[A-Za-z]{10}$";
var transactionNumberPattern = @"[A-Za-z]{3}\d{3}";

foreach (var inputValue in inputList)
{
    var isCustomerNumber = Regex.IsMatch(inputValue, customerNumberPattern);
    var isProductNumber = Regex.IsMatch(inputValue, productNumberPattern);
    var isTransactionNumber = Regex.IsMatch(inputValue, transactionNumberPattern);

    if (isCustomerNumber || isProductNumber || isTransactionNumber)
    {
        Console.WriteLine($"{inputValue} matches");
    }
    else
    {
        Console.WriteLine($"{inputValue} doesn't match");
    }
}

Sure we’ve increased the total line count, but this approach gives us a number of benefits:

  • Changes are easier to make to individual patterns
  • Version control changes are easier to track
  • Debugging is easier since you can see what the individual match results are
  • It’s easier to add logic based on individual match results

Note: You might think that the above example is contrived, but I’ve honestly seen a lot worse running in production

Name Patterns Descriptively

This follows the usual programming advice of “name things well”, but when regular expressions already have a reputation for being unintelligable you will be doing everyone a favour if you can give patterns clear names.

As you can see in the example for “Split up Your Regular Expressions”, patterns (especially large and complex ones) are worth naming as accurately as possible. Which of the below examples is clearer?

var customerOrProductOrTransactionNumberPattern = @"^[Cc]\d{5}$|^[A-Za-z]{10}$|^[A-Za-z]{3}\d{3}$";
var pattern = @"^[Cc]\d{5}$|^[A-Za-z]{10}$|^[A-Za-z]{3}\d{3}$";

Regular expressions also have a bit of a discoverability problem that makes them hard to parse without context; regex will only tell you the pattern of data it is looking for, but not what that pattern actually represents!

Use Comments Where Needed

Along with the naming of regex pattern constants, many (if not most) languages that support regular expressions also support some form of regex comments; C# for example has some excellent documentation on in-line comments.

This tip comes with a bit of a caveat though; if you’re at a point where you’ve got to add comments to a regular expression for it to be comprehensible, it’s worth considering whether regex is the best approach. If it is, make sure to follow the usual rules about comments such as “explain why, not how”.

Encapsulate Your Regex Processing

Encapsulating your regex processing inside of functions/methods is a great way of documenting what it is you’re trying to do with the pattern (through the name of the function) while hiding the complexity inside the function itself so that it doesn’t spill out into the calling code.

Let’s look at an example:

static bool IsCustomerNumber(string inputValue)
{
    var customerNumberPattern = @"^[Cc]\d{5}$"; //Imagine we're loading this from config
    return Regex.IsMatch(inputValue, customerNumberPattern);
}

With the above we don’t need to include any pattern setting/loading logic in the calling code and actually don’t have to worry about regex at all! You’ve also got the usual encapsulation benefit of being able to add additional checking logic without modifying the calling code.

Write Unit Tests

If possible, write unit tests to validate your regular expressions (this is best paired with the above tip about encapsulation since it gives you a clear interface to work against). Unit tests help you verify that not only are any changes you make to the patterns still following the existing intentions, it also gives a very helpful list of examples that should and shouldn’t work against them. Win win!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.