Regular expressions - Notes
Regular expressions - Notes
What are regular expressions?
Regular expressions (or Regex) are patterns of text that you define to search documents and match exactly what you're looking for.
Why should I learn how to use them?
Even if you won't need them sooner or later, it's a great tool to know how to use. It will make you more capable in CTF's, and potentially a better developer if that's a goal you have. You spend a little time learning it and save yourself lots of time in the long run by using it.
I know all that, but I'm lazy.
This is a lazy person's tutorial. There's a little reading, and then you learn by doing.
Where's the 'Deploy' button?
There's no machine to deploy. There are two ways to test your expressions. Either:
- create a text file with some test paragraphs (in a Unix machine) and then use
egrep <pattern> <file>
to see what matches and what doesn't, or - use an online editor like https://regexr.com/. You can add your own text in the "Text" field, and then type your expressions (patterns) in the "Expression" field.
The wildcard that is used to match any single character (except the line break) is the .
dot. That means that a.c
will match aac
, abc
, a0c
, a!c
, and so on.
Also, you can set a character as optional in your pattern using the ?
question mark. That means that abc?
will match ab
and abc
, since the c
is optional.
Note: If you want to search for .
a literal dot, you have to escape it with a \
reverse slash. That means that a.c
will match a.c
, but also abc
, a@c
, and so on. But a\.c
will match just a.c
.
There are easier ways to match bigger charsets. For example, \d
is used to match any single digit. Here's a reference:\d
matches a digit, like 9
\D
matches a non-digit, like A
or @
\w
matches an alphanumeric character, like a
or 3
\W
matches a non-alphanumeric character, like !
or #
\s
matches a whitespace character (spaces, tabs, and line breaks)\S
matches everything else (alphanumeric characters and symbols)
Note: Underscores _
are included in the \w
metacharacter and not in \W
. That means that \w
will match every single character in test_file
.
Often we want a pattern that matches many characters of a single type in a row, and we can do that with repetitions. For example, {2}
is used to match the preceding character (or metacharacter, or charset) two times in a row. That means that z{2}
will match exactly zz
.
Here's a reference for each repetition along with how many times it matches the preceding pattern:
{12}
- exactly 12 times.{1,5}
- 1 to 5 times.{2,}
- 2 or more times.*
- 0 or more times.+
- 1 or more times.
Starts with/ ends with, groups, and either/ or
Sometimes it's very useful to specify that we want to search by a certain pattern in the beginning or the end of a line. We do that with these characters:^
- starts with$
- ends with
So for example, if you want to search for a line that starts with abc
, you can use ^abc
.
If you want to search for a line that ends with xyz
, you can use xyz$
.
Note: The ^
hat symbol is used to exclude a charset when enclosed in [
square brackets]
, but when it is not, it is used to specify the beginning of a word.
You can also define groups by enclosing a pattern in (
parentheses)
. This function can be used for many ways that are not in the scope of this tutorial. We will use it to define an either/ or pattern, and also to repeat patterns. To say "or" in Regex, we use the |
pipe.
For an "either/or" pattern example, the pattern during the (day|night)
will match both of these sentences: during the day
and during the night
.
For a repetition example, the pattern (no){5}
will match the sentence nonononono
.
Conclusion
Well done.
Regular expressions are very powerful, even at their most basic usage. There are many resources to study and practise online as well, which I strongly recommend.
Also, if you're planning on using regex to develop something and you want to search for something like an e-mail, you should search for premade expressions instead of writing your own.
With regex, you have to think specific, but not too specific, because then you might come up with complicated solutions when there are other more elegant and simple ones.
Comments
Post a Comment