Monday, July 18, 2016

I'd rather use tokenize() than split() - due to this difference that could lead to bug in code...

I was recently fixing a bug in my code that used String class's split() method in groovy to split a string of values separated by comma (,) and was processing each element of the result using .each{} method of the result. That was buggy and what bugged me most was- when I was testing all possible cases, particularly, the case to test blank String ('') caused my expected result to fail. I had to switch to tokenize() in order to get my blank String ('') test-case to pass.

Here is what I noticed and found:
Both split and tokenize methods take String as an argument as delimiter for splitting and tokenizing. The subtle difference as per the API documentation is: split returns Array of Strings whereas tokenize returns List of Strings.

However, the difference for a blank string ('') is:
split() returns an Array of size 1 (unexpected and leads to bugs when looping or relying on the size) whereas tokenize() returns a List of size 0 (as expected). The first element in the Array returned by split() is nothing but the blank string itself.

Example code:
//non-blank String with comma separated values def languagesInMyCareerStr = 'C, C++, Java, Groovy' def spiltLanguages = languagesInMyCareerStr.split(', ') def tokenizedLanguages = languagesInMyCareerStr.tokenize(', ') assert ['C', 'C++', 'Java', 'Groovy'] == tokenizedLanguages assert ['C', 'C++', 'Java', 'Groovy'] == spiltLanguages assert tokenizedLanguages.class == ArrayList assert spiltLanguages.class == (String []).class assert spiltLanguages.size() == tokenizedLanguages.size() assert tokenizedLanguages.size() == 4 assert spiltLanguages.size() == 4 //blank String def languagesBeforeMyCareer = '' spiltLanguages = languagesBeforeMyCareer.split(',') tokenizedLanguages = languagesBeforeMyCareer.tokenize(',') assert spiltLanguages.size() != tokenizedLanguages.size() assert tokenizedLanguages.size() == 0 assert spiltLanguages.size() == 1 assert spiltLanguages[0] == '' //the blank string itself