Regular expressions in TEXworks Manual
Regular expressions
As TEXworks is built on the base of Qt4, the regular expressions, often referred to as regexp, available are a sub-set of the one found for Qt4. See the site of Qt4 for more complete information. It is possible to find other information about regexps on the net or from books. But pay attention that all systems (programming languages, editors,…) do not use the same set of instructions; there is no “standard set”.
Introduction
When searching and replacing, one has to define the text to be found. This can be the text itself “Abracadabra”, but often it is necessary to define the strings in a more powerful way to avoid repeating the same operation many times with only small changes from one time to the next; example, one wants to replace sequences of the letter a by one o but not all of them, only the sequences of 3, 4, 5, 6 and 7 a; this would require repeating changing 5 times. Another example: replace the vowels by §, again 5 replace operations.
Here come the regular expressions!
A simple character (a or 9) represents itself. But a set of characters can be defined: [aeiou] will match any vowel, [abcdef] the letters a b c d e f; this last set can be shortened as [a-f] using “-” between the two ends of the range.
To define a set not to be taken, one uses “^”: the caret negates the character set if it occurs as the first character, i.e. immediately after the opening square bracket.[^abc] matches anything except a b c.
Codes to represent special sets
When using regexps, very often one has to create strings which generally represent other strings, I mean, if you are looking for a string which represents an email address, the letters and symbols will vary; still you could search for any string which corresponds to an email address (text@text.text – roughly). So there are abbreviations to represent letters, figures, symbols,…
These codes replace and facilitate the definition of sets; for example to mean the set of digits [0-9], one can use “\d”. The following table lists the replacement codes.
ElementMeaning
c Any character represents itself unless it has a special regexp meaning. Thus c matches the character c.
\c A character that follows a backslash matches the character itself except where mentioned below. For example if you wished to match a literal caret at the beginning of a string you would write “^”.
\n This matches the ASCII line feed character (LF, Unix newline, used in TEXworks).
\r This matches the ASCII carriage return character (CR).
\t This matches the ASCII horizontal tab character (HT).
\v This matches the ASCII vertical tab character (VT).
\xhhhh This matches the Unicode character corresponding to the hexadecimal number hhhh (between 0x0000 and 0xFFFF).
\0ooo (i.e., zero-ooo) matches the ASCII/Latin-1 character corresponding to the octal number ooo (between 0 and 0377).
. (dot)This matches any character (including newline). So if you want to match the dot, you have to escape it “\.”.
\d This matches a digit.
\D This matches a non-digit.
\s This matches a white space.
\S This matches a non-white space.
\w This matches a word character or “_”).
\W This matches a non-word character.
\n The n-th back-reference, e.g.\1,\2, etc.
Using these abbreviations is better than describing the set, because the abbreviations remain valid in different alphabets.
Pay attention that the end of line is often taken as a white space. Under TEXworks the end of line is referred to by “\n”.
Repetition
One doesn’t work only on unique letter, digit, symbol; most of the time these are repeated (ex.: a number is a repetition of digits and symbols – in the right order).
To show the number of repetitions, one uses a so called “quantifier”: a{1,1} means at least one and only one a, a{3,7} between 3 and 7; {1,1} can be dropped, so a{1,1} = a.
This can be combined with the set notation: [0-9]{1,2} will correspond to at least one digit and at most two, the integer numbers between 0 and 99. But this will match any group of 1 or 2 figures within a string; if we want that this matches the whole string (we only have 1 or 2 figures in the string) we will write the regular expression as ^[0-9]{1,2}$; here ^ says that the required string should be the first character of the string, the $ the last, so there is only one or two figures in the string (^ and $ are “assertions” – see later for more).
Here a table of quantifiers. E represents an expression (letter, abbreviation, set).
E?
Matches zero or one occurrence of E. This quantifier means the previous expression is optional. It is the same as E{0,1}.
E+
Matches one or more occurrences of E. This is the same as E\{1,MAXINT\}.
E*
Matches zero or more occurrences of E. This is the same as E{0,MAXINT}. The * quantifier is often used by a mistake for the + quantifier. Since it matches zero or more occurrences it will match no occurrence at all.
E{n}
Matches exactly n occurrences of the expression. This is the same as repeating the expression n times.
E{n,}
Matches at least n occurrences of the expression. This is the same as E{n,MAXINT}.
E{,m}
Matches at most m occurrences of the expression. This is the same as E{0,m}.
E{n,m}
Matches at least n occurrences of the expression and at most m occurrences of the expression.
MAXINT
depends on the implementation, minimum 1024.
Alternatives and assertions
When searching, it is often necessary to search for alternatives, ex.: apple, pear, cherry, but not pineapple. To separate the alternatives, one uses |: apple|pear|cherry. But this will not prevent to find pineapple, so we have to specify that apple should be standalone, a whole word (as is often called in the search dialog boxes).
To specify that a string should be considered standalone, we specify that it is surrounded by word separators/boundaries (begin/end of sentence, space), like\bapple\b. For our alternatives example we will group them by parentheses and add the boundaries \b(apple|pear|cherry)\b. Apart from \b we have already seen ^and $.
Here a table of the “assertions” which do not correspond to characters and will never be part of the result of a search.
^
The caret signifies the beginning of the string. If you wish to match a literal ^ you must escape it by writing\^
$
The dollar signifies the end of the string. If you wish to match a literal $ you must escape it by writing\$
\b
A word boundary.
\B
A non-word boundary. This assertion is true wherever \b is false.
(?=E)
Positive lookahead. This assertion is true if the expression matches at this point in the regexp.
(?!E)
Negative lookahead. This assertion is true if the expression does not match at this point in the regexp.
Notice the different meanings of ^ as assertion and as negation inside a set!
Final notes
Using rexexp is very powerful, but then also very dangerous; you could change your text at unseen places and sometimes reverting to the previous situation is not fully possible. If you immediately see the error, you could use
Ctrl+Z
.
Showing how to exploit the full power of regexp would require much more than this extremely short summary; in fact it would require a full manual on it own.
Also note that there are some limits in the implementation of regexps in TEXworks; in particular, the assertions (^ and $) only consider the whole file.
Finally, do not forget to “tick” the regexp option when using them in the Find and Replace dialogs and to un-tick the option when not using regexps.
(Excerpt from TeXWorks Manual by Alain Delmotte)