873 lines
22 KiB
HTML
873 lines
22 KiB
HTML
<!DOCTYPE html>
|
|
<html>
|
|
<head>
|
|
<title>Regular Expressions Tutorial</title>
|
|
</head>
|
|
|
|
<body bgcolor=white>
|
|
|
|
<table width=100% border=0 summary="">
|
|
<tr>
|
|
<td align="center">
|
|
<h4>Found at: http://publish.ez.no/article/articleprint/11/</h4>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
|
|
<table width=100% border=0 summary="">
|
|
<tr>
|
|
<td>
|
|
<h1>Regular Expressions Explained</h1>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
|
|
<hr noshade="noshade" size=4>
|
|
|
|
<br>
|
|
<table width=100% border=0 summary="">
|
|
<tr>
|
|
<td>
|
|
<p class="byline">Author: Jan Borsodi</p>
|
|
</td>
|
|
<td align="right">
|
|
<p class="byline">Publishing date: 30.10.2000 18:02</p>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
|
|
<p>
|
|
This article will give you an introduction to the world of <i>regular
|
|
expressions</i>. I'll start off with explaining what regular expressions are
|
|
and introduce their syntax, then some examples with varying complexity.
|
|
|
|
<p>
|
|
<h2>Concept</h2>
|
|
A <i>regular expression</i> is a text pattern consisting of a combination of
|
|
alphanumeric characters and special characters known as metacharacters. A
|
|
close relative is in fact the <i>wildcard expression</i> which are often used
|
|
in file management. The pattern is used to match against text strings. The
|
|
result of a match is either successful or not, however when a match is
|
|
successful not all of the pattern must match, this is explained later in the
|
|
article.
|
|
|
|
<p>
|
|
You'll find that <i>regular expressions</i> are used in three different ways:
|
|
Regular text match, search and replace and splitting. The latter is basically
|
|
the same as the reverse match <i>i.e.</i> everything the <i>regular
|
|
expression</i> did not match.
|
|
|
|
<p>
|
|
<i>Regular expressions</i> are often simply called regexps or RE, but for
|
|
consistency I'll be referring to it with its full name.
|
|
|
|
<p>
|
|
Due to the versatility of the <i>regular expression</i> it is widely used in
|
|
text processing and parsing. UNIX users are probably familiar with them through
|
|
the use of the programs, <i>grep</i>, <i>sed</i>, <i>awk</i> and <i>ed</i>.
|
|
Text editors such as <i>(X)Emacs</i> and <i>vi</i> also use them heavily.
|
|
Probably the most known use of <i>regular expressions</i> are in the
|
|
programming language Perl; you'll find that Perl sports the most advanced
|
|
<i>regular expression</i> implementation to this day.
|
|
|
|
<p>
|
|
<h2>Usage</h2>
|
|
Now you're probably wondering why you should bother to learn <i>regular
|
|
expressions</i>. Well if you're a normal computer user your benefits from
|
|
using them are somewhat small, however if you're either a developer or a system
|
|
administrator you'll find that knowing <i>regular expressions</i> will make
|
|
your (professional) life so much better.
|
|
|
|
<p>
|
|
Developers can use them to parse text files, fix up code and other wonders.
|
|
System administrators can use them to search through logs, automate boring
|
|
tasks and sniff the network traffic for unauthorized activity.
|
|
|
|
<p>
|
|
Actually I would go so far as to say it's a crime for a System Administrator
|
|
not to have <b>any</b> knowledge of <i>regular expressions</i>.
|
|
|
|
<p>
|
|
<h2>Quantifiers</h2>
|
|
The contents of an expression are, as explained earlier, a combination of
|
|
alphanumeric characters and metacharacters. An alphanumeric character is either
|
|
a letter from the alphabet:
|
|
<p>
|
|
<table width=100% border=0 summary="Alphabetic Quantifiers">
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code>abc</code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
<p>
|
|
or a number:
|
|
<p>
|
|
<table width=100% border=0 summary="Numeric Quantifiers">
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code>123</code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
|
|
<p>
|
|
Actually in the world of regular expressions any character which is not a
|
|
metacharacter will match itself (often called literal characters), however a
|
|
lot of the time you're mostly concerned with the alphanumeric characters. A
|
|
very special character is the backslash <b>\</b>, as this turns any
|
|
metacharacters into literal characters, and alphanumeric characters into a
|
|
sort of metacharacter or sequence. The metacharacters are:
|
|
<p>
|
|
<table width=100% border=0 summary="Metacharacters">
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code>\ | ( ) [ { ^ $ * + ? . < ></code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
|
|
<p>
|
|
With that said normal characters don't sound too interesting so let's jump to
|
|
our very first metacharacters.
|
|
|
|
<p>
|
|
The punctuation mark, or dot, <b>.</b> needs explaining first since it often
|
|
leads to confusion. This character will not, as many might think, match the
|
|
punctuation in a line. It is instead a special metacharacter which matches any
|
|
character. Using this where you wanted to find the end of the line, or the
|
|
decimal in a floating number, will lead to strange results. As explained above,
|
|
you need to backslashify it to get the literal meaning. For instance take this
|
|
expression:
|
|
<p>
|
|
<table width=100% border=0 summary="Punctuation Marks">
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code>1.23</code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
<p>
|
|
will match the number 1.23 in a text as you might have guessed, but it will
|
|
<b>also</b> match these next lines:
|
|
<p>
|
|
<table width=100% border=0 summary="Punction Marks">
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code>1x23</code><br>
|
|
<code>1 23</code><br>
|
|
<code>1-23</code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
<p>
|
|
to make the expression <b>only</b> match the floating number we change it to:
|
|
<p>
|
|
<table width=100% border=0>
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code>1\.23</code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
|
|
<p>
|
|
Remember this, it's very important. Now with that said we can get the show
|
|
going.
|
|
|
|
<p>
|
|
Two heavily recurring metacharacters are:
|
|
<p>
|
|
<table width=100% border=0>
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code>*</code> and <code>+</code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
|
|
<p>
|
|
They are called quantifiers and tell the engine to look for several occurrences
|
|
of a character; the quantifier always precedes the character at hand. The
|
|
<code><b>*</b></code> character matches <b>zero or more</b> occurrences of the
|
|
character in a row, the <code><b>+</b></code> character is similar but matches
|
|
<b><i>one</i> or more</b>.
|
|
|
|
<p>
|
|
So if you decided to find words which had the character <code><b>c</b></code>
|
|
in it you might be tempted to write:
|
|
<p>
|
|
<table width=100% border=0>
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code>c*</code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
|
|
<p>
|
|
What might come as a surprise to you is that you will find an enormous amount
|
|
of matches, even words with no <code><b>c</b></code> in them will match.
|
|
"How is this so?" you ask. Well, the answer is simple. Recall that the
|
|
<code><b>*</b></code> character matches <b>zero</b> or more characters, and
|
|
that's exactly what you matched, zero characters.
|
|
|
|
<p>
|
|
You see, in <i>regular expressions</i> you have the possibility to match what
|
|
is called the <b>empty string</b>, which is simply a string with zero size.
|
|
This empty string can actually be found in all texts. For instance the word:
|
|
<p>
|
|
<table width=100% border=0>
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code>go</code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
<p>
|
|
contains three empty strings. They are contained at the position right before
|
|
the <code><b>g</b></code>, in between the <code><b>g</b></code> and the
|
|
<code><b>o</b></code> and after the <code><b>o</b></code>. And an empty string
|
|
contains exactly <b>one</b> empty string. At first this might seem like a
|
|
really silly thing to do, but you'll learn later on how this is used in more
|
|
complex expressions.
|
|
|
|
<p>
|
|
So with this knowledge we might want to change our expression to:
|
|
<p>
|
|
<table width=100% border=0>
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code>c+</code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
<p>
|
|
and voila we get only words with <code><b>c</b></code> in them.
|
|
|
|
<p>
|
|
The next metacharacter you'll learn is:
|
|
<p>
|
|
<table width=100% border=0>
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code>?</code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
<p>
|
|
This simply tells the engine to either match the character or not (zero or
|
|
one). For instance the expression:
|
|
<p>
|
|
<table width=100% border=0>
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code>cows?</code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
<p>
|
|
will match any of these lines:
|
|
<p>
|
|
<table width=100% border=0>
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code>cow</code><br>
|
|
<code>cows</code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
<p>
|
|
These three metacharacters are simply a specialized scenario for the more
|
|
generalized quantifier:
|
|
<p>
|
|
<table width=100% border=0>
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code>{n,m}</code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
<p>
|
|
the <code><b>n</b></code> and <code><b>m</b></code> are respectively the
|
|
minimum and maximum size for the quantifier. For instance:
|
|
<p>
|
|
<table width=100% border=0>
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code>{1,5}</code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
<p>
|
|
means match one or up to five characters. You can also skip
|
|
<code><b>m</b></code> to allow for infinite match:
|
|
<p>
|
|
<table width=100% border=0>
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code>{1,}</code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
<p>
|
|
which matches one or more characters. This is exactly what the
|
|
<code><b>+</b></code> character does. So now you see the connection,
|
|
<code><b>*</b></code> is equal to <code><b>{0,}</b></code>,
|
|
<code><b>+</b></code> is equal to <code><b>{1,}</b></code> and
|
|
<code><b>?</b></code> is equal to <code><b>{0,1}</b></code>.
|
|
|
|
<p>
|
|
The last thing you can do with the quantifier is to also skip the comma:
|
|
<p>
|
|
<table width=100% border=0>
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code>{5}</code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
<p>
|
|
which means to match 5 characters, no more, no less.
|
|
|
|
<p>
|
|
<h2>Assertions</h2>
|
|
The next type of metacharacters is assertions. These will match if a given
|
|
assertion is true. The first pair of assertions are:
|
|
<p>
|
|
<table width=100% border=0>
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code>^</code> and <code>$</code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
<p>
|
|
which match the beginning of the line, and the end of the line, respectively.
|
|
Note that some <i>regular expression</i> implementations allows you to change
|
|
their behavior so that they will instead match the beginning of the text and
|
|
the end of the text. These assertions always match a zero length string, or in
|
|
other words, they match a position. For instance, if you wrote this expression:
|
|
<p>
|
|
<table width=100% border=0>
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code>^The</code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
<p>
|
|
it would match any line which began with the word <code><b>The</b></code>.
|
|
|
|
<p>
|
|
The next assertion characters match at the beginning and end of a word; they
|
|
are:
|
|
<p>
|
|
<table width=100% border=0>
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code><</code> and <code>></code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
<p>
|
|
they come in handy when you want to match a word precisely. For instance:
|
|
<p>
|
|
<table width=100% border=0>
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code>cow</code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
<p>
|
|
would match any of the following words:
|
|
<p>
|
|
<table width=100% border=0>
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code>cow</code><br>
|
|
<code>coward</code><br>
|
|
<code>cowage</code><br>
|
|
<code>cowboy</code><br>
|
|
<code>cowl</code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
<p>
|
|
A small change to the expression:
|
|
<p>
|
|
<table width=100% border=0>
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code><cow></code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
<p>
|
|
and you'll only match the word <code><b>cow</b></code> in the text.
|
|
|
|
<p>
|
|
One last thing to be said is that all literal characters are in fact assertions
|
|
themselves. The difference between them and the ones above is that literal
|
|
characters have a size. So for cleanliness sake we only use the word
|
|
"assertions" for those that are zero-width.
|
|
|
|
<p>
|
|
<h2>Groups and Alternation</h2>
|
|
One thing you might have noticed when we explained quantifiers is that they
|
|
only worked on the character to the left. Since this pretty much limits our
|
|
expressions I'll explain other uses for quantifiers. Quantifiers can also be
|
|
used on metacharacters; using them on assertions is silly since assertions are
|
|
zero-width and matching one, two, three or more of them doesn't do any good.
|
|
However the grouping and sequence metacharacters are perfect for being
|
|
quantified. Let's first start with grouping.
|
|
|
|
<p>
|
|
You can form groups, or subexpressions as they are frequently called, by using
|
|
the begin and end parenthesis characters:
|
|
<p>
|
|
<table width=100% border=0>
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code>(</code> and <code>)</code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
|
|
<p>
|
|
The <code><b>(</b></code> starts the subexpression and the <code><b>)</b></code>
|
|
ends it. It is also possible to have one or more subexpressions inside a
|
|
subexpressions. The subexpression will match if the contents match. So mixing
|
|
this with quantifiers and assertions you can do:
|
|
<p>
|
|
<table width=100% border=0>
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code>( ?ho)+</code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
<p>
|
|
which matches all of the following lines:
|
|
<p>
|
|
<table width=100% border=0>
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code>ho</code><br>
|
|
<code>ho ho</code><br>
|
|
<code>ho ho ho</code><br>
|
|
<code>hohoho</code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
|
|
<p>
|
|
Another use for subexpressions are to extract a portion of the match if it
|
|
matches. This is often used in conjunction with sequences, which are discussed
|
|
later.
|
|
|
|
<p>
|
|
You can also use the result of a subexpression for what is called a back
|
|
reference. A back reference is given by using a backslashified digit, a single
|
|
non-zero digit. This leaves you with nine back references (0 through 9). The
|
|
back reference matches whatever the corresponding subexpression actually
|
|
matched (except that <code>{<i>article_contents_1</i>}</code> matches a null
|
|
character). To find the number of the subexpression, count the open parentheses
|
|
from the left.
|
|
|
|
<p>
|
|
The uses for back references are somewhat limited, especially since you only
|
|
have nine of them, but on some rare occasion you might need it. Note some
|
|
<i>regular expression</i> implementations can use multi-digit numbers as long
|
|
as they don't start with a 0.
|
|
|
|
<p>
|
|
Next are alternations, which allow you to match on any of many words. The
|
|
alternation character is:
|
|
<p>
|
|
<table width=100% border=0>
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code>|</code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
<p>
|
|
A sample usage is:
|
|
<p>
|
|
<table width=100% border=0>
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code>Bill|Linus|Steve|Larry</code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
<p>
|
|
would match either Bill, Linus, Steve or Larry. Mixing this with subexpressions
|
|
and quantifiers we can do:
|
|
<p>
|
|
<table width=100% border=0>
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code>cow(ard|age|boy|l)?</code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
<p>
|
|
which matches any of the following words but no others:
|
|
<p>
|
|
<table width=100% border=0>
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code>cow</code><br>
|
|
<code>coward</code><br>
|
|
<code>cowage</code><br>
|
|
<code>cowboy</code><br>
|
|
<code>cowl</code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
|
|
<p>
|
|
I mentioned earlier in the article that not all of the expression must match
|
|
for the match to be successful. This can happen when you're using
|
|
subexpressions together with alternations. For example:
|
|
<p>
|
|
<table width=100% border=0>
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code>((Donald|Dolly) Duck)|(Scrooge McDuck)</code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
<p>
|
|
As you see only the left or right top subexpression will match, not both. This
|
|
is sometimes handy when you want to run a complex pattern in one subexpression
|
|
and if it fails try another one.
|
|
|
|
<p>
|
|
<h2>Sequences</h2>
|
|
Last we have sequences, which define sequences of characters which can match.
|
|
Sometimes you don't want to match a word directly but rather something that
|
|
resembles one. The sequence characters are:
|
|
<p>
|
|
<table width=100% border=0>
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code>[</code> and <code>]</code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
|
|
<p>
|
|
Any characters put inside the sequence brackets are treated as literal
|
|
characters, even metacharacters. The only special characters are
|
|
<code><b>-</b></code> which denotes character ranges, and <code><b>^</b></code>
|
|
which is used to negate a sequence. The sequence is somewhat similar to
|
|
alternation; the similarity is that only one of the items listed will match.
|
|
For instance:
|
|
<p>
|
|
<table width=100% border=0>
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code>[a-z]</code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
<p>
|
|
will match any lowercase characters which are in the English alphabet (a to z).
|
|
Another common sequence is:
|
|
<p>
|
|
<table width=100% border=0>
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code>[a-zA-Z0-9]</code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
|
|
<p>
|
|
Which matches any lowercase or capital characters in the English alphabet as
|
|
well as numbers. Sequences are also mixed with quantifiers and assertions to
|
|
produce more elaborate searches. Example:
|
|
<p>
|
|
<table width=100% border=0>
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code><[a-zA-Z]+></code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
<p>
|
|
matches all whole words. This will match:
|
|
<p>
|
|
<table width=100% border=0>
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code>cow</code><br>
|
|
<code>Linus</code><br>
|
|
<code>regular</code><br>
|
|
<code>expression</code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
<p>
|
|
but will not match:
|
|
<p>
|
|
<table width=100% border=0>
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code>200</code><br>
|
|
<code>x-files</code><br>
|
|
<code>C++</code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
|
|
<p>
|
|
Now if you wanted to find anything <i>but</i> words, the expression:
|
|
<p>
|
|
<table width=100% border=0>
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code>[^a-zA-Z0-9]+</code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
<p>
|
|
would find any sequences of characters which do not contain the English
|
|
alphabet or any numbers.
|
|
|
|
<p>
|
|
Some implementations of <i>regular expressions</i> allow you to use shorthand
|
|
versions for commonly used sequences, they are:
|
|
<p>
|
|
<table width=100% border=0>
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code>\d</code>, a digit (<code>[0-9]</code>)<br>
|
|
<code>\D</code>, a non-digit (<code>[^0-9]</code>)<br>
|
|
<code>\w</code>, a word (alphanumeric) (<code>[a-zA-Z0-9]</code>)<br>
|
|
<code>\W</code>, a non-word (<code>[^a-zA-Z0-9]</code>)<br>
|
|
<code>\s</code>, a whitespace (<code>[ \t\n\r\f]</code>)<br>
|
|
<code>\S</code>, a non-whitespace (<code>[^ \t\n\r\f]</code>)
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
|
|
<p>
|
|
<b>Wildcards</b><br>
|
|
For people who have some knowledge with wildcards (also known as
|
|
<i>file globs</i> or <i>file globbing</i>), I'll give a brief explanation on how
|
|
to convert them to <i>regular expressions</i>. After reading this article, you
|
|
probably have seen the similarities with wildcards. For instance:
|
|
<p>
|
|
<table width=100% border=0>
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code>*.jpg</code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
<p>
|
|
matches any text which end with <code><b>.jpg</b></code>. You can also specify
|
|
brackets with characters, as in:
|
|
<p>
|
|
<table width=100% border=0>
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code>*.[ch]pp</code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
<p>
|
|
matches any text which ends in <code><b>.cpp</b></code> or
|
|
<code><b>.hpp</b></code>. Altogether very similar to regular expressions.
|
|
|
|
<p>
|
|
The <code><b>*</b></code> means match zero or more of anything in wildcards. As
|
|
we learned, we do this is regular expression with the punctuation mark and the
|
|
<code><b>*</b></code> quantifier. This gives:
|
|
<p>
|
|
<table width=100% border=0>
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code>.*</code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
<p>
|
|
Also remember to convert any punctuation marks from wildcards to be
|
|
backslashified.
|
|
|
|
<p>
|
|
The <code><b>?</b></code> means match any character but do match
|
|
<b>something</b>. This is exactly what the punctuation mark does.
|
|
|
|
<p>
|
|
Square brackets can be used untouched since they have the same meaning going
|
|
from wildcards to regular expressions.
|
|
|
|
<p>
|
|
These leaves us with:
|
|
<ul>
|
|
<li>Replace any <code><b>*</b></code> characters with <code><b>.*</b></code>
|
|
<li>Replace any <code><b>?</b></code> characters with <code><b>.</b></code>
|
|
<li>Leave square brackets as they are.
|
|
<li>Replace any characters which are metacharacters with a backslashified
|
|
version.
|
|
</ul>
|
|
|
|
<p>
|
|
<b>Examples</b><br>
|
|
<p>
|
|
<table width=100% border="0">
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code>*.jpg</code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
<p>
|
|
would be converted to:
|
|
<p>
|
|
<table width=100% border=0>
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code>.*\.jpg</code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
|
|
<p>
|
|
<table width=100% border=0>
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code>ez*.[ch]pp</code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
<p>
|
|
would convert to:
|
|
<p>
|
|
<table width=100% border=0>
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code>ez.*\.[ch]pp</code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
<p>
|
|
or alternatively:
|
|
<p>
|
|
<table width=100% border=0>
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code>ez.*\.(cpp|hpp)</code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
|
|
<p>
|
|
<h3>Example Regular Expressions</h3>
|
|
To really get to know <i>regular expressions</i> I've left some commonly used
|
|
expressions on this page. Study them, experiment and try to understand exactly
|
|
what they are doing.
|
|
|
|
<p>
|
|
Email validity: will only match email addresses which are valid, such as
|
|
"user@host.com":
|
|
<p>
|
|
<table width=100% border=0>
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code>[a-z0-9_-]+(\.[a-z0-9_-]+)*@[a-z0-9_-]+(\.[a-z0-9_-]+)+</code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
|
|
<p>
|
|
Email validity #2: matches email addresses with a name in front, like "John
|
|
Doe <user@host.com>":
|
|
<p>
|
|
<table width=100% border=0>
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code>("?[a-zA-Z]+"?[ \t]*)+\<[a-z0-9_-]+(\.[a-z0-9_-]+)*@[a-z0-9_-]+(\.[a-z0-9_-]+)+\></code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
|
|
<p>
|
|
Protocol validity: matches web based protocols such as "htpp://",
|
|
"ftp://" or "https://":
|
|
<p>
|
|
<table width=100% border=0>
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code>[a-z]+://</code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
|
|
<p>
|
|
C/C++ includes: matches valid include statements in C/C++ files:
|
|
<p>
|
|
<table width=100% border=0>
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code>^#include[ \t]+[<"][^>"]+[">]</code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
|
|
<p>
|
|
C++ end of line comments:
|
|
<p>
|
|
<table width=100% border=0>
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code>//.+$</code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
|
|
<p>
|
|
C/C++ span line comments (it has one flaw, can you spot it?):
|
|
<p>
|
|
<table width=100% border=0>
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code>/\*[^*]*\*/</code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
|
|
<p>
|
|
Floating point numbers: matches simple floating point numbers of the kind 1.2
|
|
and 0.5:
|
|
<p>
|
|
<table width=100% border=0>
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code>-?[0-9]+\.[0-9]+</code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
|
|
<p>
|
|
Hexadecimal numbers: matches C/C++ style hex numbers, <i>e.g.</i>
|
|
<code>0xcafebabe</code>:
|
|
<p>
|
|
<table width=100% border=0>
|
|
<tr>
|
|
<td bgcolor="#f0f0f0">
|
|
<code>0x[0-9a-fA-F]+</code>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
|
|
</body>
|
|
</html>
|