Friday, January 14, 2011

How to not match a string with regular expressions

Regular expressions are a robust method for matching strings, and finding patterns in text. But it takes some time to learn, for instance, consider extracting all strings in quotations from this text:
a "sample" text "fragment".

The initial attempt is something like this:
".*"

But since regular expressions use greedy matching, it will match:
"sample" text "fragment"

The fix for this would be to change any character (.) to match any character except " ([^"]):
"[^"]*"

And this solution works, but what if the delimiter is a string and not just a single character? For example:
a <!--sample--> text <!--fragment-->.

You cannot use any character except, you need a any character except one that is followed by, and this is called negative lookahead.

To match any character except one that is followed by --> you can use the following expression:
(.(?!-->))*

Applying this on the sample text matches all text except the e and t in sample and fragment respectively. If we want the entire inner text matched, we need to add the excluded character:
(.(?!-->))*.

This will match the entire text, so now we can apply the delimiters:
<!--(.(?!-->))*.-->

Derek Slager has an online regular expression tester.

7 comments:

Anonymous said...

That was a awesome read,You discover something new every day.

Anonymous said...

Thank you for giving us the opportunity to see how you work and learn so much!

Anonymous said...

Been looking for this article for long time ago and finally found here. thanks for sharing this post. appreciate!

Anonymous said...

I have been searching for this information and finally found it. Thanks!

Anonymous said...

Nice one, might come in handy in the near future

Anonymous said...

very interesting, thanks

Anonymous said...

Excellent article, a great deal of valuable information.