How to Efficiently Compare Strings in Go

May 21, 2018 5.9k views
Go

Comparing strings might not be something you think about when optimizing software. Typically optimization includes tasks like splitting loops across goroutines, finding a faster hashing algorithm, or something that sounds more scientific. There is a sense of accomplishment we get when making changes like this. However, string comparison can often times be the top bottleneck in a pipeline. The snippet below is often used, but it is the worst solution (benchmarks below) and it has caused real problems.

strings.ToLower(name) == strings.ToLower(othername)

This appears to be pretty straight forward. Convert each string to lowercase and then compare. To understand why this is a bad solution you have to know what a string represents and how ToLower works. But first, lets talk about the primary use-cases for string comparisons. When comparing using the normal == operator, we get the quickest and most optimized solution. However, APIs and similar software usually take case into consideration. This is when we drop in ToLower and call it feature-complete.

In Go, a string is an immutable sequence of runes. Rune is a term Go uses to represent a Code Point. You can read more about Strings, bytes, runes and characters at the Go Blog. ToLower is a standard library function that loops over each rune in a string, converts it to lowercase, and returns the newly formed string. So the above code traverses each string entirely before the comparison. It is tight bound to the length of the strings. Here is some pseudo code that roughly represents the complexity of the above snippet.

Note: Because strings are immutable, strings.ToLower allocates new memory space for two new strings. This contributes to the time complexity, but this is not the focus now. For brevity, the pseudo code below assumes that strings are mutable.

// Pseudo code
func CompareInsensitive(a, b string) bool {
    // loop over string a and convert every rune to lowercase
    for i := 0; i < len(a); i++ {  a[i] = unicode.ToLower(a[i])  }
    // loop over string b and convert every rune to lowercase
    for i := 0; i < len(b); i++ {  b[i] = unicode.ToLower(b[i])  }
    // loop over both a and b and return false if there is a mismatch
    for i := 0; i < len(a); i++ {
        if a[i] != b[i] { 
            return false
        }
    }
    return true
}

The time complexity is O(n) where n is len(a) + len(b) + len(a). Here's an example:

CompareInsensitive("fizzbuzz", "buzzfizz")

That means we will loop up to 24 times to discover that two completely distinct strings do not match. This is highly inefficient. We could tell these strings were distinct by comparing unicode.ToLower(a[0]) and unicode.ToLower(b[0]) (pseudo code). So lets take that into consideration.

To optimize, We can remove the first two loops in CompareInsensitive and compare each character in each position. If runes don’t match we would then convert the runes to lowercase and then compare again. If they still don’t match then we break the loop and consider the two strings a mismatch. If they match we can continue to the next rune until the end is reached or until a mismatch is found. Lets rewrite this code.

// Pseudo code
func CompareInsensitive(a, b string) bool {
    // a quick optimization. If the two strings have a different
    // length then they certainly are not the same
    if len(a) != len(b) {
        return false
    }

    for i := 0; i < len(a); i++ {
        // if the characters already match then we don't need to 
        // alter their case. We can continue to the next rune
        if a[i] == b[i] { 
            continue
        }
        if unicode.ToLower(a[i]) != unicode.ToLower(b[i]) {
            // the lowercase characters do not match so these
            // are considered a mismatch, break and return false
            return false
        }
    }
    // The string length has been traversed without a mismatch
    // therefore the two match
    return true
}

The new function is much more efficient. The upper bounds is the length of one string rather than the sum of the length of both strings. What does this look like with our comparison above? Well, the loop will only ever run 8 times at most. However, since the first two runes are not the same this loop only runs once. We have optimized our comparison over 20x!

Fortunately, there is a function in the strings package for this. It’s called strings.EqualFold.

Benchmarks

// When both strings are equal
BenchmarkEqualFoldBothEqual-8                   20000000               124 ns/op
BenchmarkToLowerBothEqual-8                     10000000               339 ns/op

// When both strings are equal until the last rune
BenchmarkEqualFoldLastRuneNotEqual-8            20000000               129 ns/op
BenchmarkToLowerLastRuneNotEqual-8              10000000               346 ns/op

// When both strings are distinct
BenchmarkEqualFoldFirstRuneNotEqual-8           300000000             11.2 ns/op
BenchmarkToLowerFirstRuneNotEqual-8             10000000               333 ns/op

// When both strings have a different case at rune 0
BenchmarkEqualFoldFirstRuneDifferentCase-8      20000000               125 ns/op
BenchmarkToLowerFirstRuneDifferentCase-8        10000000               433 ns/op

// When both strings have a different case in the middle
BenchmarkEqualFoldMiddleRuneDifferentCase-8     20000000               123 ns/op
BenchmarkToLowerMiddleRuneDifferentCase-8       10000000               428 ns/op

There is a staggering difference (30x) when the first rune of each string does not match. This is because instead of looping over both strings and then comparing, the loop only runs one time and immediately returns false. In every case, EqualFold outperforms our initial comparison by orders of magnitude.

Does it Matter?

You might be thinking that 400 nanoseconds does not matter. In most cases you might be right. However, some micro-optimizations are as simple as any other solution. In this case it is easier than the original solution. Quality engineers have simple optimizations in their normal work flow. They don't wait to optimize software once it becomes an issue, they write optimized software from the beginning. Even for the best engineers It is unlikely and unrealistic to write the most efficient software from the ground up. It is virtually impossible to think of every edge-case and optimize for it. After all, we rarely know the wild behaviors of our users until we set them loose on our software. However, embedding these simple solutions into your normal work flow will improve the lifespan of applications and prevent unnecessary bottlenecks in the future. Even if that bottleneck never matters, you didn't waste any effort.

3 Answers

Nice one!

Though it should be unicode.ToLower(a[0]), not unicode.ToLower(a[o]).

I think CompareInsensitive should note that it is ignoring non-ASCII characters ("for brevity" I guess), in case someone is tempted to copy it.

Thank you for the article! Here are some nitpicks. Not to discourage you -- we're all at different places of our journey to understanding programming, Go, etc.

Aside from there being a library routine that is probably faster, there is a major bug in this implementation.

If you are going to assume the string is utf8, you can't simply iterate over:

for i := 0; i < len(a); i++ {

What happens if there is a multibyte character? Even when you range over a string, you not only get a copy of the rune but you get the actual byte offset of the rune which will not correspond to the index of the rune if it is a multi byte glyph.

Also, you're not passing unicode.ToLower() a proper rune when you index the string that way -- you're only sending it a byte. So this isn't properly squashing multi byte glyps to lower case.

Take a look through the Go unicode package. There is a function that does what you want just as quickly AND correctly.

  • As mentioned in the notes, none of the code is intended to be accurate Go code. It is pseudo code used for brevity to explain the difference in iterations when using ToLower and EqualFold.

    • Ah -- sorry. Perhaps I should have read the article more carefully. And I meant the strings.EqualFold function.

      Still -- it is difficult to keep in my mind that this is not real code since it is really close to correct and includes benchmarks. One does not typically benchmark pseudo code.

      I'll stop while I've only said a few stupid things. Great article.

      • No worries. I should have made it a little more pseudo so that it was more obvious :P

        • OR... and this is a wild idea: how about just use real code? :P It actually seems more difficult for you as a writer and more confusing to readers to have to preface everything with, "this is not real!"

Have another answer? Share your knowledge.