x

Webpusher

Mark Lennox

a regrettable tendency to Javascript

Mark Lennox Javascript, C#, Python, machine learning, web, devops, mobile

Regexp - named groups FTW!

7th January, 2020

5 min read

If you think regex is the solution to your problem, now you have two problems...

circleOfFlame

Everybody loves regex, or they think they do...

You dive into a sea of coding and emerge at the other side with some regex matches in your teeth - success! You lie on your metaphorical beach warmed by the sun of your own genius. But, then the edge cases start washing up beside you, stinking the place up. By the time you've thrown them all back in the sea your hands are filthy and the sun has crept low to the horizon where it sulks behind some clouds. You're cold and you realise regular expressions were probably not the right approach.

But they are cool!

What are they good for?

Regular expressions are actually pretty good for matching patterns in data where the data follows a well-defined format - IP addresses, and emails being a prime example.

However, they are often abused to try and 'quickly' parse JSON or HTML - don't be tempted, that way lies madness! - use a real parser instead!

As you should be using your regex on well-formatted data you should make life easier for yourself and take advantage of named groups.

What in the name of context-free grammars are 'named groups'?

Well, first lets remind ourselves what matched groups are with some examples - we'll work up to named groups.

We'll use the text below as input in each example.

const source = `GB98MIDL07009312345678`;

This is actually one of the example IBAN codes provided on the IBAN wikipedia page, so if you are a programmer at a major blue chip bank who came here from google - welcome!

First with no groups

Lets say you have the following regular expression

// will match UK bank account IBAN
const noGroupMatcher = new RegExp(/[a-zA-Z]{2}\d{2}[a-zA-Z]{4}\d{6}\d{8}/);

If we execute this regular expression using our source text we'll get the result below

const result = noGroupMatcher.exec(source)
// ['GB98MIDL07009312345678', index: 0, input: 'GB98MIDL07009312345678', groups: undefined]

The result looks a bit funky, but it is a RegExp result array with the keys - 0, index, input, groups.

With no groups provided in the regular expression there are no matched groups in the output, and as expected we match the whole input string, so the matched data we want is

result[0]
// 'GB98MIDL07009312345678'

Ok, so we can verify that the input matches the format of a UK bank account IBAN - that is useful! Your manager in that blue-chip bank is probably nodding at you in an appreciative and thoughtful way by now - good job!

but wait! there's more!

Now with groups

We can tell regexp to 'capture' groups of matched elements using brackets - updating the regular expression above, we add brackets around each sub pattern. This indicates we want to pull out the individual values matched.

const groupedMatcher = new RegExp(/([a-zA-Z]{2})(\d{2})([a-zA-Z]{4})(\d{6})(\d{8})/);

If we execute this regular expression we'll see a different result

const result = groupedMatcher.exec(source)
// ['GB98MIDL07009312345678', 'GB', '98', 'MIDL', '070093', '12345678', index: 0, input: 'GB98MIDL07009312345678', groups: undefined]

Great! we can pull out each separate piece of information from the IBAN - this is beginning to be useful! It is much clearer to see the country code, bank code, bank branch sort code, and the bank account number. The drawback is we need to access them by array index

result[1]
// 'GB'
result[4]
// '070093'

Your manager is probably emailing the managers of other teams boasting about the shit hot programmer he hired and hinting at the huge bonus he'll get this Christmas, not to mention his manager nodding at you both in an appreciative way, and perhaps even stroking his chin! Oh yes, you're cooking now!

But why isn't the groups property assigned a value - we indicated we wanted to capture groups, didn't we? Well, this property is actually only assigned a value when we use named groups, so lets have a look at that now.

Named groups

We've pulled out lots of information about the IBAN, but using named groups we can get a lot more value from regular expressions.

We take our group matcher and update it to add a name for each set of brackets that capture the patterns they enclose.

const namedGroupMatcher = new RegExp(/(?<countryCode>[a-zA-Z]{2})(?<checkDigits>\d{2})(?<bankCode>[a-zA-Z]{4})(?<branchSortCode>\d{6})(?<bankAccountNumber>\d{8})/);

Now, when we execute this against the IBAN the result is very useful indeed.

const result = groupedMatcher.exec(source)
// [
//   'GB98MIDL07009312345678', 'GB', '98', 'MIDL', '070093', '12345678',
//   index: 0,
//   input: 'GB98MIDL07009312345678',
//   groups: {
//       countryCode: 'GB',
//       checkDigits: '98',
//       bankCode: 'MIDL',
//       branchSortCode: '070093',
//       bankAccountNumber: '12345678'
//   }
// ]

The result.groups object contains the properly named data values we have parsed from the IBAN, we can reference this directly instead of trying to remember the position of a piece of information in an array. The name also gives context to the data, improving your coding experience and making maintenance easier.

result.groups.countryCode
// 'GB'
result.groups.branchSortCode
// '070093'

This performance will not doubt earn you the key to the executive washroom, the loan of the CEOs car, and a week in the Bahamas. You'll never want for swingline staplers, and you'll have a corner office on the top floor. Your original manager has been promoted to president and given you the keys to the city.

Attaboy!

Congratulations, galaxy brain

Now you have tamed the power of regular expressions, use them wisely, remember:

with great power comes great responsibility