In 30 minutes you will understand what a regular expression is and have some basic understanding of it so that you can use it in your own programs or web pages.
How to use this tutorial
Don’t be intimidated by the complex expressions below, just follow me step by step and you’ll find that regular expressions are actually not as difficult as you might think. Of course, if you find that you understand a lot after reading this tutorial, but can’t remember almost anything, that’s normal —— I think there’s zero chance that people who haven’t touched regular expressions can remember more than 80% of the syntax mentioned after reading this tutorial. This is just to let you understand the basic principles, later you need to practice more and use more to be proficient in regular expressions.
In addition to being an introductory tutorial, this article is intended to be a reference manual for regular expression syntax that you can use in your daily work. In the author’s own experience, this goal is still well accomplished——You see, I didn’t manage to write everything down myself, did I?
What the heck is a regular expression?
When writing programs or web pages that deal with strings, there is often a need to find strings that match certain complex rules. Regular expressions are the tools used to describe these rules. In other words, a regular expression is the code that records the text rules.
Most likely you have used the wildcard characters (* and ?) used for file lookup under Windows/Dos. If you wanted to find all Word documents in a directory, you would search for *.doc, where * would be interpreted as any string. Similar to wildcards, regular expressions are used to perform text matching, except that they describe your needs more precisely than wildcards —— at the cost of more complexity, of course —— for example, you could write a regular expression to find all documents that start with 0, followed by 2-3 numbers, followed by a hyphen “-” and finally a string of 7 or 8 digits (like 010-12345678 or 0376-7654321).
Getting Started
The best way to learn regular expressions is to start with examples, understand them and then modify and experiment with them yourself. A number of simple examples are given below, with detailed explanations of them.
Suppose you are looking for hi in an English novel, you can use the regular expression hi.
This is almost the simplest regular expression that can match exactly the string consisting of two characters, the first one being h and the second one being i. Usually, tools that handle regular expressions provide an option to ignore case, and if this option is checked, it can match any of the four cases hi,HI,Hi,hI.
Unfortunately, many words contain the two consecutive characters hi, such as him, history, high, and so on. If you use hi to find, here the hi will also be found. To find the word hi exactly, we should use \bhi\b.
\b is a special code (well, some people call it a metacharacter) specified by a regular expression that represents the beginning or end of a word, that is, the division of the word. While words in English are usually separated by spaces, punctuation, or newlines, \b does not match any of these word separators; it matches only one position.
If you are looking for hi followed by a Lucy not far behind, you would use \bhi\b.*\bLucy\b.
Here, . is another metacharacter, matching any character other than a newline character. * is also a metacharacter, but it represents not a character, nor a position, but a number —— it specifies that what precedes * can be repeated any number of times in succession to get the whole expression matched. Thus, . * concatenated together means any number of characters that do not contain a newline. Now the meaning of \bhi\b.*\bLucy\b is obvious: first the word hi, then any arbitrary character (but not a newline), and finally the word Lucy.
If we also use other metacharacters, we can construct more powerful regular expressions. For example, the following example.
0\d\d-\d\d\d\d\d\d\d\d\d matches a string that starts with 0, then two digits, then a hyphen “-“, and finally 8 digits (that is, the phone number in China. Of course, this example can only match the case where the area code is 3 digits).
Here \d is a new metacharacter, matching one digit (0, or 1, or 2, or ……). -is not a metacharacter, matching only itself ——hyphen (or minus, or mid-cross, or whatever you want to call it).
To avoid so much annoying repetition, we can also write this expression like this: 0\d{2}-\d{8}. Here the {2}({8}) after \d means that the preceding \d must be repeated 2 times in a row (8 times).
Testing regular expressions
If you don’t find regular expressions hard to read and write, you’re either a genius or, well, you’re not from Earth. The syntax of regular expressions is a headache, even for those who use it regularly. Because it is hard to read and write and prone to errors, it is essential to find a tool to test regular expressions.
Some details of regular expressions are different in different environments. This tutorial describes the behavior of regular expressions under Microsoft .Net Framework 4.0, so I recommend the tool I wrote Regular Expression Tester under . Please refer to the instructions on that page to install and run the software.
Here is a screenshot of Regex Tester when it is running.
Now you know a few useful metacharacters, such as \b,. ,*, and \d. There are more metacharacters in regular expressions, such as \s matches any blank character, including space, tab, newline, Chinese full-angle space, etc. \w matches letters or numbers or underscores or Chinese characters, etc.
Here are some more examples.
\ba\w\b matches words starting with the letter a——first at the beginning of some word (\b), then at the letter a,then at any number of letters or numbers (\w), and finally at the end of the word (\b).
\d+ matches 1 or more consecutive numbers. The + here is a similar metacharacter to *, the difference being that * matches any number of repetitions (possibly 0), while + matches 1 or more repetitions.
\b\w{6}\b matches words with exactly 6 characters.
代码 | 说明 |
---|---|
. | 匹配除换行符以外的任意字符 |
\w | 匹配字母或数字或下划线或汉字 |
\s | 匹配任意的空白符 |
\d | 匹配数字 |
\b | 匹配单词的开始或结束 |
^ | 匹配字符串的开始 |
$ | 匹配字符串的结束 |
The metacharacters ^ (a symbol in the same key as the number 6) and $ both match a position, which is somewhat similar to \b. The ^ matches the beginning of the string you want to use to find, and the $ matches the end. These two codes are very useful when validating input, for example a website that requires you to fill in a QQ number with 5 to 12 digits could use: ^\d{5,12}$.
The {5,12} here is similar to the {2} introduced earlier, except that the {2} match can only be repeated no more than 2 times, while the {5,12} can be repeated no less than 5 times and no more than 12 times, otherwise they don’t match.
Because of the use of ^ and $, the entire input string must be used to match \d{5,12}, which means that the entire input must be 5 to 12 numbers, so if the input QQ number matches this regular expression, it will meet the requirements.
Similar to the option to ignore case, some regular expression processing tools have an option to handle multiple lines. If this option is checked, the meaning of ^ and $ becomes the beginning and end of the matching line.