Regular Expressions

From Psygen Wiki
Jump to navigation Jump to search

A regular expression (or regex for short) is a standard way of using text to form a search to match patterns.

Similar to using an asterisk like this: *.jpg in a search box to find all JPEG files, you can use a regular expression (along with something like grep) to match much more complex patterns.

For example, you could use:

\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b

to search for any e-mail addresses in a file


Cheat Sheet

a - Literal character, like the letter, "a". Every character is literal except these twelve: \ ^ $ . | ? * + ( ) [ {

. (dot) - a single character.

? - the preceding character matches 0 or 1 times only.

* - the preceding character matches 0 or more times.

+ - the preceding character matches 1 or more times.

{n} - the preceding character matches exactly n times.

{n,m} - the preceding character matches at least n times and not more than m times. Example: a{2,4} match the character at least twice, but not more than four times.

[agd] - the character is one of those included within the square brackets.

[^agd] - the character is not one of those included within the square brackets.

[c-f] - the dash within the square brackets operates as a range. In this case it means either the letters c, d, e or f. You can use numbers to specify a range of numbers as well.

() - allows us to group several characters to behave as one.

| (pipe symbol) - the logical OR operation.

^ - matches the beginning of the line.

$ - matches the end of the line.

\ - escapes a special character. For example, if you want to see if a file has a question mark in it, you can't use the question mark symbol because it has a special meaning. So, we escape (tell regex to ignore it's special meaning and treat it as a literal character) it by putting a backslash in front of it. Like this: \?


Anchor Characters

Regular expressions examine the text between separators. If you want to search for a pattern that is at one end or the other, you use anchors. The character ^ is the starting anchor, and the character $ is the end anchor.
Note that ^ and $ are only anchors if the are used at the start (^) or end ($) of a pattern.

Examples:
Pattern Matches
^A     "A" at the beginning of a line
A$     "A" at the end of a line
A^     "A^" anywhere on a line
$A     "$A" anywhere on a line
^^     "^" at the beginning of a line
$$     "$" at the end of a line


Shorthand Characters

\s     will match whitespaces (a space, tab, or line break)
\d     will match digits (0-9, you cal also use [0-9])
\w     will match word characters (A-z, a-z, and _ (underscore))

Word Boundaries

These identify the boundaries associated with words.

\<     used for beginning of the word

\>     used for end of the word

\b     used for either beginning or end of the word

References

  1. Ryan's Tutorials Grep and Regular Expressions
  2. Regular Expressions Info
  3. grymoire Regular Expressions tutorial