Tutorial

An Introduction To Regular Expressions

Updated on October 27, 2020
Default avatar

By Shantanu Kulkarni

An Introduction To Regular Expressions

Introduction

As system administrators, developers, QA engineers, support engineers, etc. one needs to find a particular pattern, like a set of IP addresses belonging to certain range or a range of time-stamps or groups of domain or subdomain names, from files. One might also need to find a word spelled in a particular way or find possible typos in a file. This is where regular expressions come in.

Regular expressions are templates to match patterns (or sometimes not to match patterns). They provide a way to describe and parse text. This tutorial will give an insight to regular expressions without going into particularities of any language. We will simply use egrep to explain the concepts.

Regular Expressions

Regular expressions consists of two types of characters:

  • the regular literal characters and

  • the metacharacters

These metacharacters are the ones which give the power to the regular expressions.

Consider the following country.txt file where the first column is the country name, the the second column is the population of the country, and the third column is the continent.

$ cat country.txt
India,1014003817,Asia
Italy,57634327,Europe
Yemen,1184300,Asia
Argentina,36955182,Latin America
Brazil,172860370,Latin America
Cameroon,15421937,Africa
Japan,126549976,Asia

Anchor Metacharacters

The first group of “metacharacter” we will discuss are ^ and $. ^ and $ matches the start and end of a pattern respectively and are called anchor metacharacters.

To find out the name of all the countries whose country name starts with I, we use the expression:

$ egrep '^I' country.txt
India,1014003817,Asia
Italy,57634327,Europe

or to find out all the countries which have continent names ending with e, we do:

$ egrep 'e$' country.txt
Italy,57634327,Europe

The next metacharacter is the dot (.), which matches any one character. To match all the lines in which the country name is exactly 5 characters long:

$ egrep '^.....,' country.txt
India,1014003817,Asia
Italy,57634327,Europe
Yemen,1184300,Asia
Japan,126549976,Asia

How about finding all lines in which country name starts with either I or J and the country name is 5 characters long?

$ egrep '^[IJ]....,' country.txt
India,1014003817,Asia
Italy,57634327,Europe
Japan,126549976,Asia

[…] is called as a character set or a character class. Inside a character set only one of the given characters is matched.

An ^ inside the character set negates the character set. The following example will match country names five characters long but which do not start with either I or J.

$ egrep '^[^IJ]....,' country.txt
Yemen,1184300,Asia

The Grouping Metacharacter and the Alternation

To match all the line containing Asia or Africa:

$ egrep 'Asia|Africa' country.txt
India,1014003817,Asia
Yemen,1184300,Asia
Cameroon,15421937,Africa
Japan,126549976,Asia

This can be also done by taking A and a common.

$ egrep 'A(si|fric)a' country.txt
India,1014003817,Asia
Yemen,1184300,Asia
Cameroon,15421937,Africa
Japan,126549976,Asia

Quantifiers

Instead of writing

$ egrep '^[IJ]....,' country.txt

we can write

$ egrep '^[IJ].{4},' country.txt

where {} are called as the quantifiers. They determine how many times the character before them should occur.

We can give a range too:

$ egrep '^[IJ].{4,6},' country.txt
India,1014003817,Asia
Italy,57634327,Europe
Japan,126549976,Asia

This will match country names starting with I or J and having 4 to 6 character after it.

There are some shortcuts available for the quantifiers. For example,

{0,1} is equivalent to ?

$ egrep '^ab{0,1}c$' filename

is the same as

$ egrep '^ab?c' filename

{0,} is equivalent to *

$ egrep '^ab{0,}c$' filename

is the same as

$ egrep '^ab*c' filename

{1,} is equivalent to +

$ egrep '^ab{1,}c$' filename

is the same as

$ egrep '^ab+c' filename

Let us see some examples involving the expressions we have seen so far. Here instead of searching from a file, we search from standard input. The trick we use is that we know grep (or egrep) searches for a pattern, and if a pattern is found, then the entire line containing the pattern is shown.

We would like to find out all the possible ways to spell the sentence the grey colour suit was his favourite.

The expression would be:

$ egrep 'the gr[ea]y colou?r suit was his favou?rite'
the grey color suit was his favourite
the grey color suit was his favourite

the gray colour suit was his favorite
the gray colour suit was his favorite

Looking at the expression above, we can see that:

  • grey can be spelled as grey or gray

  • colour can be written as colour or color, that means u is optional so we use u?

  • similarly favourite or favorite can be written favou?rite

How about matching a US zip code?

$ egrep '^[0-9]{5}(-[0-9]{4})?$'
83456
83456

83456-

834562

92456-1234
92456-1234

10344-2342-345

One more example of matching all valid times in a 24 hour clock.

$ egrep '^([01][0-9]|2[0-3]):[0-5][0-9]:[0-5][0-9]'
23:44:02
23:44:02

33:45:11

15:45:33
15:45:33

In the above example we said that, if the first digit of the hour is either 0 or 1, then the second one will be any from 0 to 9. But if the first digit is 2, then the allowed values for second digit are 0,1, 2 or 3.

Word Boundary

To write a pattern to match the words ending with color such that unicolor, watercolor, multicolor etc. is matched but not colorless or colorful. Try these examples yourself, to get familiar with them:

$ egrep 'color\>'

Next, to match colorless and colorful, but not unicolor, watercolor, multicolor, etc.

$ egrep '\<color'

Thereby to match the exact word color, we do:

$ egrep '\<color\>'

Backreferences

Suppose we want to match all words which were double typed, like the the or before before, we have to use backreferences. Backreferences are used for remembering patterns.

Here’s an example:

$ egrep "\<the\> \1"

Or the generic way:

$ egrep "\<(.*)\> \1"

The above example can be used to find all names in which the first and the last names are the same. In case there are more than one set of parentheses, then the second, third fourth etc. can be referenced with \2, \3, \4 etc.

This is just an introduction to the power of regular expressions.

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about us


About the authors
Default avatar
Shantanu Kulkarni

author

Still looking for an answer?

Ask a questionSearch for more help

Was this helpful?
 
4 Comments


This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

Thanks guys!!

Awesome tutorial, very well explained. Thank you very much and keep writing.

Thanks for the encouragement. Regards, Shantanu

This is a very nice tutorial on regular expression. There have been may more extensive tutorials on which at some point I got confused. Thanks again keep writing your doing an excellent job.

Try DigitalOcean for free

Click below to sign up and get $200 of credit to try our products over 60 days!

Sign up

Join the Tech Talk
Success! Thank you! Please check your email for further details.

Please complete your information!

Get our biweekly newsletter

Sign up for Infrastructure as a Newsletter.

Hollie's Hub for Good

Working on improving health and education, reducing inequality, and spurring economic growth? We'd like to help.

Become a contributor

Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.

Welcome to the developer cloud

DigitalOcean makes it simple to launch in the cloud and scale up as you grow — whether you're running one virtual machine or ten thousand.

Learn more
DigitalOcean Cloud Control Panel