Regular Expressions are extraordinarily powerful and complex. Recently I needed to split lines into whitespace separated segments (words) retaining all the whitespace.
Above is my final bit of code which worked correctly. To understand the magic, first we must acknowledge that I couldn’t match any of the whitespace. The standard operating procedure for regular expressions is to “select” the matched text, and splitting on that text means losing it.
The solution is a concept called lookaround (in the above case,
lookbehind rather than lookahead) which doesn’t select the matching
text. Lookbehind is defined by
?< (as opposed to just
? for lookahead) and the
= means I am search for a match (as opposed to
! which means I am looking for text that doesn’t match).
The text I am looking for is defined by
\S which is a non whitespace character (equivalent to
^ means “isn’t”), and
which refers to a whitespace character (the double slash is due to Java
string syntax). And because I am using lookbehind, I am actually
searching for a non whitespace character followed by a whitespace
Ok, so at every single character in the line I’m splitting, I look back at the previous character for a non whitespace character, and if I find it I look back to see if the character behind that one is a whitespace character. If indeed it is, I have my match and the line is split at the start of the match. Namely one whitespace character, the word, and all proceeding whitespace character until the one before the next word.
In summery, if my line was “this is a test”, I would end up with an array which read ["this ", " is ", " a ", " test"]. And while this obviously isn’t perfect, it does serve my purposes perfectly.
(For reasons that remains unclear, the parenthesis sorrounding the expression are part of lookaround and not are backreferences in the normal sense.)
If you want to know more, check out the heroic attempts of regular-expressions.info to clarify and simplify the nightmare and glory that are, regular expressions.