Character classes
Consider a practical task – we have a phone number like "+7(903)-123-45-67"
, and we need to turn it into pure numbers: 79035419441
.
To do so, we can find and remove anything that’s not a number. Character classes can help with that.
A character class is a special notation that matches any symbol from a certain set.
For the start, let’s explore the “digit” class. It’s written as pattern:\d
and corresponds to “any single digit”.
For instance, the let’s find the first digit in the phone number:
1 | let str = "+7(903)-123-45-67"; |
Without the flag pattern:g
, the regular expression only looks for the first match, that is the first digit pattern:\d
.
Let’s add the pattern:g
flag to find all digits:
1 | let str = "+7(903)-123-45-67"; |
That was a character class for digits. There are other character classes as well.
Most used are:
pattern:\d
(“d” is from “digit”)- A digit: a character from
0
to9
.
pattern:\s
(“s” is from “space”)- A space symbol: includes spaces, tabs
\t
, newlines\n
and few other rare characters, such as\v
,\f
and\r
.
pattern:\w
(“w” is from “word”)- A “wordly” character: either a letter of Latin alphabet or a digit or an underscore
_
. Non-Latin letters (like cyrillic or hindi) do not belong topattern:\w
.
For instance, pattern:\d\s\w
means a “digit” followed by a “space character” followed by a “wordly character”, such as match:1 a
.
A regexp may contain both regular symbols and character classes.
For instance, pattern:CSS\d
matches a string match:CSS
with a digit after it:
1 | let str = "Is there CSS4?"; |
Also we can use many character classes:
1 | alert( "I love HTML5!".match(/\s\w\w\w\w\d/) ); // ' HTML5' |
The match (each regexp character class has the corresponding result character):
Inverse classes
For every character class there exists an “inverse class”, denoted with the same letter, but uppercased.
The “inverse” means that it matches all other characters, for instance:
pattern:\D
- Non-digit: any character except
pattern:\d
, for instance a letter.
pattern:\S
- Non-space: any character except
pattern:\s
, for instance a letter.
pattern:\W
- Non-wordly character: anything but
pattern:\w
, e.g a non-latin letter or a space.
In the beginning of the chapter we saw how to make a number-only phone number from a string like subject:+7(903)-123-45-67
: find all digits and join them.
1 | let str = "+7(903)-123-45-67"; |
An alternative, shorter way is to find non-digits pattern:\D
and remove them from the string:
1 | let str = "+7(903)-123-45-67"; |
A dot is “any character”
A dot pattern:.
is a special character class that matches “any character except a newline”.
For instance:
1 | alert( "Z".match(/./) ); // Z |
Or in the middle of a regexp:
1 | let regexp = /CS.4/; |
Please note that a dot means “any character”, but not the “absense of a character”. There must be a character to match it:
1 | alert( "CS4".match(/CS.4/) ); // null, no match because there's no character for the dot |
Dot as literally any character with “s” flag
By default, a dot doesn’t match the newline character \n
.
For instance, the regexp pattern:A.B
matches match:A
, and then match:B
with any character between them, except a newline \n
:
1 | alert( "A\nB".match(/A.B/) ); // null (no match) |
There are many situations when we’d like a dot to mean literally “any character”, newline included.
That’s what flag pattern:s
does. If a regexp has it, then a dot pattern:.
matches literally any character:
1 | alert( "A\nB".match(/A.B/s) ); // A\nB (match!) |
1 | Check <https://caniuse.com/#search=dotall> for the most recent state of support. At the time of writing it doesn't include Firefox, IE, Edge. |
1 | Usually we pay little attention to spaces. For us strings `subject:1-5` and `subject:1 - 5` are nearly identical. |
Summary
There exist following character classes:
pattern:\d
– digits.pattern:\D
– non-digits.pattern:\s
– space symbols, tabs, newlines.pattern:\S
– all butpattern:\s
.pattern:\w
– Latin letters, digits, underscore'_'
.pattern:\W
– all butpattern:\w
.pattern:.
– any character if with the regexp's'
flag, otherwise any except a newline\n
.
…But that’s not all!
Unicode encoding, used by JavaScript for strings, provides many properties for characters, like: which language the letter belongs to (if it’s a letter) it is it a punctuation sign, etc.
We can search by these properties as well. That requires flag pattern:u
, covered in the next article.