Split Words + Spaces with Java + Regex

2010, Dec 10, 05:12 am
Java, Regex

Wherein I elucidate the regular expression necessary for splitting lines into whitespace separated segments, retaining all whitespace.

Regular Expressions are extraordinarily powerful and complex. Recently I needed to split lines into whitespace separated segments (words) retaining all the whitespace.


Above is my final bit of code which worked correctly. To understand the magic, first we must acknowledge that I couldn’t match any of the whitespace. The standard operating procedure for regular expressions is to “select” the matched text, and splitting on that text means losing it.

The solution is a concept called lookaround (in the above case, lookbehind rather than lookahead) which doesn’t select the matching text. Lookbehind is defined by ?< (as opposed to just ? for lookahead) and the = means I am search for a match (as opposed to ! which means I am looking for text that doesn’t match).

The text I am looking for is defined by \S which is a non whitespace character (equivalent to ^\s where ^ means “isn’t”), and \s which refers to a whitespace character (the double slash is due to Java string syntax). And because I am using lookbehind, I am actually searching for a non whitespace character followed by a whitespace character.

Ok, so at every single character in the line I’m splitting, I look back at the previous character for a non whitespace character, and if I find it I look back to see if the character behind that one is a whitespace character. If indeed it is, I have my match and the line is split at the start of the match. Namely one whitespace character, the word, and all proceeding whitespace character until the one before the next word.

In summery, if my line was “this    is    a    test”, I would end up with an array which read ["this   ", " is   ", " a   ", " test"]. And while this obviously isn’t perfect, it does serve my purposes perfectly.

(For reasons that remains unclear, the parenthesis sorrounding the expression are part of lookaround and not are backreferences in the normal sense.)

If you want to know more, check out the heroic attempts of regular-expressions.info to clarify and simplify the nightmare and glory that are, regular expressions.


Add a comment